Re: Accidentally cloned Superset to druid-io

2023-05-18 Thread Charles Allen
Thank you, Xavier!

On Thu, May 18, 2023 at 8:47 AM Xavier Léauté 
wrote:

> No worries, I can take care of it.
>
> On Thu, May 18, 2023 at 7:45 AM Charles Allen  wrote:
>
> > https://github.com/druid-io/superset
> >
> > I clicked too fast and didn't realize cloning went to druid-io instead of
> > my own stuff. Sorry about that! I'm going to work with the repo admins to
> > rectify and minimize the chance of it happening again.
> >
> > Cheers,
> > Charles Allen
> >
>


Accidentally cloned Superset to druid-io

2023-05-18 Thread Charles Allen
https://github.com/druid-io/superset

I clicked too fast and didn't realize cloning went to druid-io instead of
my own stuff. Sorry about that! I'm going to work with the repo admins to
rectify and minimize the chance of it happening again.

Cheers,
Charles Allen


Pinot tiered storage

2020-03-17 Thread Charles Allen
https://docs.google.com/document/d/1Z4FLg3ezHpqvc6zhy0jR6Wi2OL8wLO_lRC6aLkskFgs/edit

FYI, this proposal just came across the incubator mailing list.


Re: Druid and machine learning

2020-01-28 Thread Charles Allen
> One corner case is sketches which are time series, so models could be
applied to them individually.

Or if there is some case for composeable models that have some sort of
intermediate stage. I don't know of any models who have intermediate stages
which are associative and commutative, but if there were it might be a case
to quickly derive new models from combining intermediate stages in an
ad-hoc fashion.


On Tue, Jan 28, 2020 at 12:39 AM Roman Leventov 
wrote:

> However, I now see the Charles' point -- the data which is typically stored
> in Druid rows is simple and is not something models are typically applied
> to. Timeseries themselves (that is, the results of timeseries queries in
> Druid) may be an input for anomaly detection or phase transition models,
> but there is not point in applying them inside Druid.
>
> One corner case is sketches which are time series, so models could be
> applied to them individually.
>
> On Tue, 28 Jan 2020 at 08:59, Roman Leventov 
> wrote:
>
> > I was thinking about model training at Druid indexing side and evaluation
> > at Druid querying side.
> >
> > The advantage Druid has over Spark at querying is faster row filtering
> > thanks to bitset indexes. But since model evaluation is a pretty heavy
> > operation (I suppose; does anyone has ballpark time estimates? how does
> it
> > compare to Sketch update?) then row scanning may not be the bottleneck
> and
> > therefore no significant reason to use Druid instead of just plugging
> Spark
> > engine to Druid segments.
> >
> > At indexing side, Druid indexer may be considered a general-purpose job
> > scheduler so that somebody who already has Druid may leverage it instead
> of
> > setting up a separate Airflow scheduler.
> >
> > On Tue, 28 Jan 2020, 06:46 Charles Allen,  wrote:
> >
> >> >  it makes more sense to have tooling around Druid, to do slice and
> dice
> >> the data that you need, and do the ml stuff in sklearn, or even in spark
> >>
> >> I agree with this sentiment. Druid as an execution engine is very good
> at
> >> doing distributed aggregation (distributed reduce). What advantage does
> >> Druid as an engine have that Spark does not for ML?
> >>
> >> Are you talking training or model evaluation? or any?
> >>
> >> It *might* be possible to have a likeness mechanism, whereby you can
> pass
> >> in a model as a filter and aggregate on rows (dimension tuples?) that
> >> match
> >> the model by some minimum criteria, but I'm not really sure what utility
> >> that would be. Maybe as a quick backtesting engine? I feel like I'm a
> >> solution searching for a problem going down this route though.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Jan 27, 2020 at 12:11 AM Driesprong, Fokko  >
> >> wrote:
> >>
> >> > > Vertica has it. Good idea to introduce it in Druid.
> >> >
> >> > I'm not sure if this is a valid argument. With this argument, you can
> >> > introduce anything into Druid. I think it is good to be opinionated,
> >> and as
> >> > a community why we do or don't introduce ML possibilities into the
> >> > software.
> >> >
> >> > For example, databases like Postgres and Bigquery allow users to do
> >> simple
> >> > regression models:
> >> > https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. I also
> >> don't
> >> > think it isn't that hard to introduce linear regression using gradient
> >> > decent into Druid:
> >> >
> >> >
> >>
> https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
> >> > However,
> >> > how many people are going to use this?
> >> >
> >> > For me, it makes more sense to have tooling around Druid, to do slice
> >> and
> >> > dice the data that you need, and do the ml stuff in sklearn, or even
> in
> >> > spark. For example using https://github.com/druid-io/pydruid or
> having
> >> the
> >> > ability to use Spark to read directly from the deep storage.
> >> >
> >> > Introducing models using SP or UDF's is also a possibility, but here I
> >> > share the concerns of Sayat when it comes to performance and
> >> scalability.
> >> >
> >> > Cheers, Fokko
> >> >
> >> >
> >> >
> >> > Op za 25 jan. 2020 om 08:51 schreef Gaurav Bhatnagar <
> >> gaura...@gmail.c

Re: Druid and machine learning

2020-01-28 Thread Charles Allen
Having a smart segment balancer in the coordinator that used a "segment
work" based distribution model would be awesome. Matching up the work a
segment is likely to induce to the historical capacity of a node to do
work, all in a dynamic way... you wouldn't even need retention periods
anymore! Especially cool if the system could fall back to pulling from deep
storage ad hoc.



On Tue, Jan 28, 2020 at 3:28 PM Jonathan Wei  wrote:

> I'm not that familiar with machine learning, but is there potential value
> in having Druid be a "consumer" of machine learning, such as for
> optimization purposes?
>
> For example, training a Druid cluster on past queries as part of a query
> cost estimator.
>
>
>
> On Tue, Jan 28, 2020 at 12:39 AM Roman Leventov 
> wrote:
>
> > However, I now see the Charles' point -- the data which is typically
> stored
> > in Druid rows is simple and is not something models are typically applied
> > to. Timeseries themselves (that is, the results of timeseries queries in
> > Druid) may be an input for anomaly detection or phase transition models,
> > but there is not point in applying them inside Druid.
> >
> > One corner case is sketches which are time series, so models could be
> > applied to them individually.
> >
> > On Tue, 28 Jan 2020 at 08:59, Roman Leventov 
> > wrote:
> >
> > > I was thinking about model training at Druid indexing side and
> evaluation
> > > at Druid querying side.
> > >
> > > The advantage Druid has over Spark at querying is faster row filtering
> > > thanks to bitset indexes. But since model evaluation is a pretty heavy
> > > operation (I suppose; does anyone has ballpark time estimates? how does
> > it
> > > compare to Sketch update?) then row scanning may not be the bottleneck
> > and
> > > therefore no significant reason to use Druid instead of just plugging
> > Spark
> > > engine to Druid segments.
> > >
> > > At indexing side, Druid indexer may be considered a general-purpose job
> > > scheduler so that somebody who already has Druid may leverage it
> instead
> > of
> > > setting up a separate Airflow scheduler.
> > >
> > > On Tue, 28 Jan 2020, 06:46 Charles Allen,  wrote:
> > >
> > >> >  it makes more sense to have tooling around Druid, to do slice and
> > dice
> > >> the data that you need, and do the ml stuff in sklearn, or even in
> spark
> > >>
> > >> I agree with this sentiment. Druid as an execution engine is very good
> > at
> > >> doing distributed aggregation (distributed reduce). What advantage
> does
> > >> Druid as an engine have that Spark does not for ML?
> > >>
> > >> Are you talking training or model evaluation? or any?
> > >>
> > >> It *might* be possible to have a likeness mechanism, whereby you can
> > pass
> > >> in a model as a filter and aggregate on rows (dimension tuples?) that
> > >> match
> > >> the model by some minimum criteria, but I'm not really sure what
> utility
> > >> that would be. Maybe as a quick backtesting engine? I feel like I'm a
> > >> solution searching for a problem going down this route though.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Jan 27, 2020 at 12:11 AM Driesprong, Fokko
>  > >
> > >> wrote:
> > >>
> > >> > > Vertica has it. Good idea to introduce it in Druid.
> > >> >
> > >> > I'm not sure if this is a valid argument. With this argument, you
> can
> > >> > introduce anything into Druid. I think it is good to be opinionated,
> > >> and as
> > >> > a community why we do or don't introduce ML possibilities into the
> > >> > software.
> > >> >
> > >> > For example, databases like Postgres and Bigquery allow users to do
> > >> simple
> > >> > regression models:
> > >> > https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. I also
> > >> don't
> > >> > think it isn't that hard to introduce linear regression using
> gradient
> > >> > decent into Druid:
> > >> >
> > >> >
> > >>
> >
> https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
> > >> > However,
> > >> > how many people are going to use this?
> > >> >

Re: Druid and machine learning

2020-01-27 Thread Charles Allen
>  it makes more sense to have tooling around Druid, to do slice and dice
the data that you need, and do the ml stuff in sklearn, or even in spark

I agree with this sentiment. Druid as an execution engine is very good at
doing distributed aggregation (distributed reduce). What advantage does
Druid as an engine have that Spark does not for ML?

Are you talking training or model evaluation? or any?

It *might* be possible to have a likeness mechanism, whereby you can pass
in a model as a filter and aggregate on rows (dimension tuples?) that match
the model by some minimum criteria, but I'm not really sure what utility
that would be. Maybe as a quick backtesting engine? I feel like I'm a
solution searching for a problem going down this route though.






On Mon, Jan 27, 2020 at 12:11 AM Driesprong, Fokko 
wrote:

> > Vertica has it. Good idea to introduce it in Druid.
>
> I'm not sure if this is a valid argument. With this argument, you can
> introduce anything into Druid. I think it is good to be opinionated, and as
> a community why we do or don't introduce ML possibilities into the
> software.
>
> For example, databases like Postgres and Bigquery allow users to do simple
> regression models:
> https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. I also don't
> think it isn't that hard to introduce linear regression using gradient
> decent into Druid:
>
> https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
> However,
> how many people are going to use this?
>
> For me, it makes more sense to have tooling around Druid, to do slice and
> dice the data that you need, and do the ml stuff in sklearn, or even in
> spark. For example using https://github.com/druid-io/pydruid or having the
> ability to use Spark to read directly from the deep storage.
>
> Introducing models using SP or UDF's is also a possibility, but here I
> share the concerns of Sayat when it comes to performance and scalability.
>
> Cheers, Fokko
>
>
>
> Op za 25 jan. 2020 om 08:51 schreef Gaurav Bhatnagar :
>
> > +1
> >
> > Vertica has it. Good idea to introduce it in Druid.
> >
> > On Mon, Jan 13, 2020 at 12:52 AM Dusan Maric  wrote:
> >
> > > +1
> > >
> > > That would be a great idea! Thanks for sharing this.
> > >
> > > Would just like to chime in on Druid + ML model cases: predictions and
> > > anomaly detection on top of TensorFlow ❤
> > >
> > > Regards,
> > >
> > > On Fri, Jan 10, 2020 at 6:41 AM Roman Leventov 
> > > wrote:
> > >
> > > > Hello Druid developers, what do you think about the future of Druid &
> > > > machine learning?
> > > >
> > > > Druid has been great at complex aggregations. Could (should?) It make
> > > > inroads into ML? Perhaps aggregators which apply the rows against
> some
> > > > pre-trained model and summarize results.
> > > >
> > > > Should model training stay completely external to Druid, or it could
> be
> > > > incorporated into Druid's data lifecycle on a conceptual level, such
> > as a
> > > > recurring "indexing" task which stores the result (the model) in
> > Druid's
> > > > deep storage, the model automatically loaded on historical nodes as
> > > needed
> > > > (just like segments) and certain aggregators pick up the latest
> model?
> > > >
> > > > Does this make any sense? In what cases Druid & ML will and will not
> > work
> > > > well together, and ML should stay a Spark's prerogative?
> > > >
> > > > I would be very interested to hear any thoughts on the topic, vague
> > ideas
> > > > and questions.
> > > >
> > >
> > >
> > > --
> > > Dušan Marić
> > > mob.: +381 64 1124779 | e-mail: thema...@gmail.com | skype: themaric
> > >
> >
>


Re: stop() method for extensions module

2019-12-04 Thread Charles Allen
I had problems with this as well. Check out
https://github.com/apache/incubator-druid/pull/6798 for some ways to handle
this.

On Wed, Dec 4, 2019 at 9:22 AM Krishna Likhitha Katakam <
krishna.likhi...@phonepe.com> wrote:

> Hi,
>
> I have a basic question:
> When we write a custom Druid extensions module, if we have some
> resources (like an HTTP client with non-daemon threads) created as part of
> the module, there is no close() method currently where we can safely clean
> up the resources. Could anyone help how to do this?
>
> Our use-case:
> We have an extension used as part of Kafka Supervisor. We need to do some
> clean up for the Peon process to stop successfully. Else, the peon process
> is stuck after SUCCESS state.
>


Re: Discussion: Moving DataSketches to core

2019-10-31 Thread Charles Allen
Any time we discuss moving things into core Druid I would love to see a
list of dependencies that comes with it.

On Wed, Oct 30, 2019, 6:08 PM Jihoon Son  wrote:

> +1 on moving too.
>
> On Mon, Oct 28, 2019 at 12:46 PM Fangjin Yang  wrote:
>
> > +1 on moving datasketches to core
> >
> > On Mon, Oct 28, 2019 at 12:36 PM Chi Cao Minh 
> > wrote:
> >
> > > To support range partitioning for native parallel batch indexing, I’m
> > > considering moving DataSketches from extensions to core (see
> > > https://github.com/apache/incubator-druid/issues/8769 <
> > > https://github.com/apache/incubator-druid/issues/8769> for details).
> > > Having DataSketches in core would also allow us to switch usages of
> > > HyperLogLogCollector to the better HLL implementation available in
> > > DataSketches. One drawback is that moving DataSketches to core will
> > > possibly block the work to upgrade DataSketches to the latest version:
> > > https://github.com/apache/incubator-druid/pull/8647 <
> > > https://github.com/apache/incubator-druid/pull/8647>.
> > >
> > > Any other thoughts on the pros/cons?
> > >
> > > Thanks,
> > > Chi
> >
>


Re: Graduation

2019-06-13 Thread Charles Allen
+1



On Fri, Jun 7, 2019, 9:27 PM Gian Merlino  wrote:

> Hey Druids,
>
> Druid has been in the incubator for a while, and we have done 4 releases so
> far (0.13.0, 0.14.0, 0.14.1, and 0.14.2) with a 5th on the way. There has
> been some discussion off-list recently about pushing for graduation and it
> was pointed out that it is way past time to have a discussion about
> graduation readiness on-list. So the topic of discussion for this thread
> is: are we ready to graduate?
>
> Here are some links I'm aware of that describe what a podling needs to do
> to be able to graduate.
>
> 1) http://incubator.apache.org/projects/druid.html
> 2) https://incubator.apache.org/guides/graduation.html
> 3)
>
> https://incubator.apache.org/policy/incubation.html#graduating_from_the_incubator
> 4) https://whimsy.apache.org/pods/project/druid
>
> We have done a lot of the hard stuff already. I think in terms of community
> robustness and adherence to the Apache Way, we were there before we even
> got into the incubator. Known remaining items (known by me, at least):
>
> - Website migration from http://druid.io/ to https://druid.apache.org/.
> Current status: full details are in the the "proposed website migration
> thread", but TLDR is that site migration is almost complete, hopefully
> within days of being done.
>
> - Website content update to match (1) above: not sure if it's being worked
> on, but shouldn't take long. Contribs welcome.
>
> I think we are good on other stuff, but I might have missed something so
> please chime in anyone / everyone. Here's looking forward to graduation!!
>


Re: Druid community weekly meetings

2019-05-21 Thread Charles Allen
There was a discussion on this forum a few weeks ago where it was mentioned
we would cancel the weekly meetings.

The chief driver was that the dev list was providing sufficient forum for
sync ups and is way more in sync with the Apache Way.

if you have specific concerns about dropping the weekly meeting, can you
please call them out?

Cheers,
Charles Allen


On Tue, May 21, 2019 at 10:14 AM Anastasia Braginsky
 wrote:

> Hi Everyone,
> Are there no weekly meetings anymore? I recall there were on Tuesdays...
> Have it been moved to some other time?
> Thanks,Anastasia
>


Approx N-tile and complex object return values

2019-04-23 Thread Charles Allen
Hi all!

If you do not use approximate quantiles (or histograms or quantiles
double sketch) then you can stop reading.

https://github.com/apache/incubator-druid/issues/7486 brings up an
issue related to how objects are returned from Druid aggregations,
specifically when the input aggregation configuration has a complex
configuration (like an array of input values). I'm bringing the
discussion to the dev list so that any decisions are part of a more
official Apache review process and not accidentally tucked away in
github thread. Please be sure to check out the thread and
AlexanderSaydakov's insights.

>  If a single quantile is requested, then the best answer must be NaN, not 
> zero since zero is a perfectly good number and would be deeply misleading. 
> What to do if an array of quantiles is requested?

I'm inclined to say the expected data shape returned should be preserved.

Let's say there's an alternate world where some other quantiles
estimation algorithm can either converge or not converge but it
depends on the % you requested. Like choosing the 50th percentile
might converge and give you a value but choosing the 99.99% might not.
In such a world it would be possible for SOME of the requested values
to resolve but not others. In this same world, if you were to do two
aggregators at `50%` and `99.99%`, vs one aggregator at `[50%,
99.99%]`, I would hope the result would be directly relatable, and
that the array form would be one of optimization or convenience.

As such, and since
`org.apache.druid.query.aggregation.histogram.ApproximateHistogram`
already sets a precedence for returning an array of `NaN`, I propose
the returned value for an array of quantiles be directly translatable
to the array-equivalent form of the result when requesting the
quantiles singularly in different aggregations. Which in this case I
believe would be an array of `NaN`.

Thoughts?
Charles Allen

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: decompression time metric, ie `query/decompress/time`

2019-04-19 Thread Charles Allen
We use the gcp profiler to get this kind of info.

I'm torn, because on one hand it is handy to be able to have such
information in the metrics system. But on the other hand you are kind
of adding in extra profiling stuff when there are other profiling
tools out there.

As another example would we ever add in page cache metrics for Druid?

On Fri, Apr 19, 2019 at 3:39 PM Egor Ryashin
 wrote:
>
> Hey,
>
> I wonder if we add a metric which shows query column decompression time,
> does it make sense? When a segment is loaded by the Historical node into
> memory the Historical has to decompress columns before filtering and
> aggregation, and I don't have any information about how much the
> decompression time contributes to the total query time.
>
> Thanks,
> Egor

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: JDK 11 support

2019-04-03 Thread Charles Allen
https://github.com/apache/incubator-druid/issues/5589 desire is there, but
needs some footwork. There are some enterprises that cannot migrate hadoop
deployments to java11, so that would have to be sorted out on how long it
would be supported. I think trying to keep up with JDK long term support
release compatibility would make sense

On Wed, Apr 3, 2019 at 6:44 AM Anoop K  wrote:

> Hi Team,
>
> JDK 8 is nearing EOL and druid does not run on JDK 11. Has there been any
> discussion to work on supporting JDK 11? Druid code base use classes from
> sun.misc package which are removed in JDK 11.
> What is the preferred migration path?
>
>   *   Build druid with JDK 11 as source version and target version.
>   *   Build druid on JDK 8 as source version and JDK 11 as target version.
> Going this approach desired functionality has to be implemented using
> reflection.
>
> I prefer moving completely to JDK 11 and not support JDK 8. Thoughts
> please. Would be happy to contribute to this activity.
>
> -anoop
>


Re: Data Types

2019-03-29 Thread Charles Allen
For my team we start from the other direction. What are people DOING with
the data. For example, if they are doing counts and sums with basic
predicates, then in what ways does the existing feature set not meet those
needs?

If they are doing other things, what is the end result they are trying to
achieve?

Can you provide more context on the end use cases?

On Fri, Mar 29, 2019, 12:24 AM zeng jienan  wrote:

> Hi,
>
> Will druid support other data types in the future? Such as boolean, byte,
> short, int.
>


Re: dns lookups cached for kafka?

2019-03-21 Thread Charles Allen
I believe so. That's what I do

On Thu, Mar 21, 2019 at 8:52 AM Don Bowman  wrote:

> On Thu, 21 Mar 2019 at 11:48, Charles Allen  .invalid>
> wrote:
>
> > Druid assumes the network layer handles whatever tuning is needed
> regarding
> > DNS resolution or IP routing. In general this means making sure you have
> > your java settings correct (see
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.aws.amazon.com_sdk-2Dfor-2Djava_v1_developer-2Dguide_java-2Ddg-2Djvm-2Dttl.html=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=1IQMROiCdjqQ6wibAPPOTXcW7n5XXtPKIpMAcOoSGVY=C9YKNBvbUEtDoV9f96z8WAFQ0s3zt9MHEKgGYhc6E_A=
> > for a related article).
> >
> >
> >
> Thanks.
>
> Since druid does not call `java.security.Security.setProperty()`, I take it
> this means my only option is to globally change the JRE in
> $JAVA_HOME/jre/lib/security/java.security?
>


Re: dns lookups cached for kafka?

2019-03-21 Thread Charles Allen
Druid assumes the network layer handles whatever tuning is needed regarding
DNS resolution or IP routing. In general this means making sure you have
your java settings correct (see
https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html
for a related article).

On Thu, Mar 21, 2019 at 7:14 AM Don Bowman  wrote:

> What is the expectation around dns and druid? Specifically, when overlord
> started, it resolved (correctly) my kafka cluster nodes.
> A little bit later I made a change which changed their IP. But overlord
> continues to use that originally resolved IP.
>
> Is there a way to force the refresh?
> Should it be re-resolving on a connection failure?
>
> [Consumer clientId=consumer-6, groupId=kafka-supervisor-bebmfiod]
> Connection to node 2 (kafka-2.kafka-headless.kafka/*10.60.8.24*:9092) could
> not be established. Broker may not be available.
>
> $ kubectl -n kafka get pods -o=wide NAME READY STATUS RESTARTS AGE IP NODE
> NOMINATED NODE kafka-0 1/1 Running 0 20m 10.60.7.45
> gke-noctest-default-pool-51f579ce-ztgf  kafka-1 1/1 Running 0 20m
> 10.60.1.48 gke-noctest-default-pool-8674014d-p008  kafka-2 1/1
> Running 0 20m *10.60.8.48* gke-noctest-default-pool-3aa530af-cztp 
> kafka-health-check-5d5b457566-2k8lf 1/1 Running 0 20m 10.60.2.47
> gke-noctest-default-pool-8674014d-rxbn 
>


Re: Proposed website migration plan

2019-03-12 Thread Charles Allen
Are there other projects who have transitioned an independently successful
domain name to an apache one?

On Tue, Mar 5, 2019 at 2:13 PM David Lim  wrote:

> Who has control over the druid.io domain? Charles would that be you?
>
> We'd need support from them for the DNS redirect.
>
> On Tue, Mar 5, 2019 at 2:04 PM Jonathan Wei  wrote:
>
> > We still need to complete the website migration to Apache infrastructure.
> >
> > I'll propose the following plan:
> >
> > Proposed Apache Druid website migration plan
> > 
> >
> > These links have some previous discussion on the website migration:
> >
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_7cae100b684e0b33e0adda993efea3d6088978700988a0ae632fdd80-40-253Cdev.druid.apache.org-253E=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=uPTu9gAHxe2KnNDGURBYp1G94UBX5LCRMknoapXwTwI=G1dTS7FlYGauxNOaQECZix2YwroWVCqJB-cT0nEeNwM=
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_INFRA-2D17340=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=uPTu9gAHxe2KnNDGURBYp1G94UBX5LCRMknoapXwTwI=pwg0jE385gqei6EEEbxugKHWll7oyKoCloFc8ByhlUc=
> >
> > From the discussions above, the recommendation is to have 2 separate
> repos
> > for the website: one for source and another for built content that will
> be
> > served.
> >
> > Generating site files
> > ===
> >
> > The Apache site update process will be similar to our current process.
> >
> > Current process:
> > 1. Push changes to
> https://github.com/druid-io/druid-io.github.io/tree/src
> > 2. metamx bot picks up changes, builds, and commits to
> > https://github.com/druid-io/druid-io.github.io/tree/master
> > 3. https://github.com/druid-io/druid-io.github.io/tree/master is served
> by
> > github pages
> >
> > Apache process:
> > 1. Push changes to https://github.com/apache/incubator-druid-website-src
> > 2. Jenkins bot from Apache will build the website from source repo,
> commit
> > to https://github.com/apache/incubator-druid-website
> > 3. Apache Druid website will be served from the content in
> > https://github.com/apache/incubator-druid-website (asf-site branch)
> >
> >
> > Hosting and SEO
> > 
> >
> > The Apache site will be hosted at druid.apache.org on Apache
> > infrastructure:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.apache.org_dev_project-2Dsite.html=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=uPTu9gAHxe2KnNDGURBYp1G94UBX5LCRMknoapXwTwI=_rHEo_asMXKypaunuBTXFkB6Ni3F6KqbEfkck18L7Ag=
> >
> > To preserve our search rankings, we can setup 301 redirects from the old
> > druid.io site to the corresponding pages on the druid.apache.org site. (
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__moz.com_learn_seo_redirection=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=uPTu9gAHxe2KnNDGURBYp1G94UBX5LCRMknoapXwTwI=lUeWU0dT9thy8gp11RO-Vry7zkYl_W4BXz01fyXJO0A=
> )
> >
> > However, Github pages (which currently hosts the druid.io site) does not
> > support 301 redirects, so we propose the following:
> > - Setup a new Nginx server that will perform 301 redirects to
> > druid.apache.org for the druid.io. Imply can host this if needed.
> > - Update the druid.io DNS entry to point to this new Nginx server
> > - Shut down Github pages hosting for druid.io
> >
> > In addition, we can also set canonical tags on our pages:
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__moz.com_learn_seo_canonicalization=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=uPTu9gAHxe2KnNDGURBYp1G94UBX5LCRMknoapXwTwI=T8G2c6d4EbQ_YDLFQXVebcj0UN9FNrbpPY5Xq4LAR8w=
> >
> >
> > Action items
> > ===
> > - Setup a Jenkins bot that builds the Apache website content from source
> > - Get the Apache website up
> > - Setup Nginx redirect server for druid.io
> > - Shutdown github pages and redirect DNS for druid.io to Nginx redirect
> > server
> > - Add canonical tags to pages
> >
>


Re: docker build

2019-03-05 Thread Charles Allen
I would support that.

On Tue, Mar 5, 2019 at 11:13 AM David Glasser 
wrote:

> Is the following a reasonable solution from both a usability and
> legal perspective?
> - Write a Dockerfile that has everything except the GPL jar on it
> (including the Druid code that talks to the jar if configured to use MySQL)
> - Automatically publish that Docker image to the ASF account on DockerHub
> - Also include a short Dockerfile in the repo that starts `FROM` our
> auto-built account and has a line or two of wget to download the jar
> (similar to the wget currently in the Dockerfile)
> - Tell users who want to use MySQL that they must publish that extra
> layered image themselves
>
> --dave
>
>
> On Tue, Mar 5, 2019 at 9:59 AM Charles Allen  .invalid>
> wrote:
>
> > Honestly we're at a very strange impasse. On one hand I don't think the
> ASF
> > project can adopt an official docker image unless ASF legal says its ok.
> > "Official" releases are source code anyways (as my understanding goes),
> and
> > binary artifacts are convenience things. Unfortunately I do not see a
> path
> > forward unless some entity is willing to take on a stance similar as
> > outlined in
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_LEGAL-2D437=DwIFaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=x7yglOWgfT8oqHqcgn_BnihcsuEZ-gE5QzWksV1PCDg=XPfeziOkU4SCezu5EyBUj-pEGcxOY7XVHU1rdNobwrw=
> . This is
> > pretty new territory from a legal perspective (the fact that docker
> images
> > are layers makes it even more interesting).
> >
> > At this point I think the safest thing to do is something that is "no
> more
> > GPL dependent than other containers in the apache repo", which would mean
> > not adding in GPL binaries. Which means switching to postgres. I don't
> > foresee an aggressive legal stance on this issue, meaning it might take a
> > while as people watch where the industry is going.
> >
> >
> >
> > On Tue, Mar 5, 2019 at 8:20 AM Don Bowman  wrote:
> >
> > > where do we stand on this?
> > > the PR is in and accepted, but i feel we need to have this built as
> part
> > of
> > > the release artifacts and on dockerhub to foster adoption.
> > > if the only issue is the mysql connector i can remove it in favour of
> the
> > > postgres connector.
> > >
> > >
> > > On Mon, 18 Feb 2019 at 13:58, Don Bowman  wrote:
> > >
> > > > i can just remove the mysql, the postgres works, i was just assuming
> > > folks
> > > > wanted it.
> > > >
> > > >
> > > > On Mon, 18 Feb 2019 at 16:58, Gian Merlino  wrote:
> > > >
> > > >> A discussion is progressing on
> > > >>
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_LEGAL-2D437=DwIFaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=SDcL2cv8y5vfiK64aTakmAF8xiMVateJ6QQ3JTsegRI=uXRQeP8EnjcHXQmG5oJoTQN2ztfu7N0YCNLpS_aj93g=
> > .
> > > It doesn't seem to have
> > > >> got anywhere firm yet.
> > > >>
> > > >> On Fri, Feb 8, 2019 at 12:23 PM Gian Merlino 
> wrote:
> > > >>
> > > >> > I don't think anything is strictly needed from you at this point,
> > but
> > > >> > things happen when people drive them, and participation in that
> > effort
> > > >> > would help make sure it gets done. I think at this point the tasks
> > on
> > > >> our
> > > >> > end are watching LEGAL-437 for advice (or making it moot by
> removing
> > > the
> > > >> > MySQL jar), asking Infra to set up automated builds once that is
> > > sorted
> > > >> > out, and building some kind of consensus around how we'll label
> and
> > > >> promote
> > > >> > the Docker images.
> > > >> >
> > > >> > On Fri, Feb 8, 2019 at 12:13 PM Don Bowman 
> > wrote:
> > > >> >
> > > >> >> i'd be fine w/ removing the mysql, i'm using postgresql for the
> > > >> metadata.
> > > >> >> if this is the case we should consider relfecting postgres as the
> > > >> default
> > > >> >> metadata in the docs.
> > > >> >> however, i think this is mere aggregation under the gpl license,
> > and
> > > >> the
> > > >> >> docker image tend

Re: docker build

2019-03-05 Thread Charles Allen
Honestly we're at a very strange impasse. On one hand I don't think the ASF
project can adopt an official docker image unless ASF legal says its ok.
"Official" releases are source code anyways (as my understanding goes), and
binary artifacts are convenience things. Unfortunately I do not see a path
forward unless some entity is willing to take on a stance similar as
outlined in https://issues.apache.org/jira/browse/LEGAL-437 . This is
pretty new territory from a legal perspective (the fact that docker images
are layers makes it even more interesting).

At this point I think the safest thing to do is something that is "no more
GPL dependent than other containers in the apache repo", which would mean
not adding in GPL binaries. Which means switching to postgres. I don't
foresee an aggressive legal stance on this issue, meaning it might take a
while as people watch where the industry is going.



On Tue, Mar 5, 2019 at 8:20 AM Don Bowman  wrote:

> where do we stand on this?
> the PR is in and accepted, but i feel we need to have this built as part of
> the release artifacts and on dockerhub to foster adoption.
> if the only issue is the mysql connector i can remove it in favour of the
> postgres connector.
>
>
> On Mon, 18 Feb 2019 at 13:58, Don Bowman  wrote:
>
> > i can just remove the mysql, the postgres works, i was just assuming
> folks
> > wanted it.
> >
> >
> > On Mon, 18 Feb 2019 at 16:58, Gian Merlino  wrote:
> >
> >> A discussion is progressing on
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_LEGAL-2D437=DwIFaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=SDcL2cv8y5vfiK64aTakmAF8xiMVateJ6QQ3JTsegRI=uXRQeP8EnjcHXQmG5oJoTQN2ztfu7N0YCNLpS_aj93g=.
> It doesn't seem to have
> >> got anywhere firm yet.
> >>
> >> On Fri, Feb 8, 2019 at 12:23 PM Gian Merlino  wrote:
> >>
> >> > I don't think anything is strictly needed from you at this point, but
> >> > things happen when people drive them, and participation in that effort
> >> > would help make sure it gets done. I think at this point the tasks on
> >> our
> >> > end are watching LEGAL-437 for advice (or making it moot by removing
> the
> >> > MySQL jar), asking Infra to set up automated builds once that is
> sorted
> >> > out, and building some kind of consensus around how we'll label and
> >> promote
> >> > the Docker images.
> >> >
> >> > On Fri, Feb 8, 2019 at 12:13 PM Don Bowman  wrote:
> >> >
> >> >> i'd be fine w/ removing the mysql, i'm using postgresql for the
> >> metadata.
> >> >> if this is the case we should consider relfecting postgres as the
> >> default
> >> >> metadata in the docs.
> >> >> however, i think this is mere aggregation under the gpl license, and
> >> the
> >> >> docker image tends to have other (e.g. bash) gpl code. druid's start
> >> >> scripts are all bash-specific as an example.
> >> >>
> >> >> I'm not clear if anything further is needed of me, i'm hoping to get
> an
> >> >> automated build going into dockerhub, and tagged w/ each release. i
> >> think
> >> >> this will help adoption.
> >> >>
> >> >>
> >> >>
> >> >> On Fri, 8 Feb 2019 at 14:22, Gian Merlino  wrote:
> >> >>
> >> >> > First off thanks a lot for your work here Don!!
> >> >> >
> >> >> > I really do think, though, that we need to be careful about the
> >> >> inclusion
> >> >> > of the MySQL connector jar. ASF legal has been clear in the past
> that
> >> >> ASF
> >> >> > projects should not distribute it as part of binary convenience
> >> >> releases:
> >> >> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_LEGAL-2D200=DwIFaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=SDcL2cv8y5vfiK64aTakmAF8xiMVateJ6QQ3JTsegRI=tzsmBm2IIaa5BS9lQrTc9e0GDt09RmMiI4gfn9CoHT4=.
> I think having the
> >> >> > Dockerfile in the repo is probably fine: in that case we are not
> >> >> > distributing the jar itself, just, essentially, a pointer to how to
> >> >> > download it. But if we start offering a prebuilt Docker image, it
> is
> >> >> less
> >> >> > clear to me if that is fine or not. In the interests of resolving
> >> this
> >> >> > question one way or the other, I opened a question asking about
> this
> >> >> > specific situation:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_LEGAL-2D437=DwIFaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=SDcL2cv8y5vfiK64aTakmAF8xiMVateJ6QQ3JTsegRI=uXRQeP8EnjcHXQmG5oJoTQN2ztfu7N0YCNLpS_aj93g=
> .
> >> >> >
> >> >> > About Dylan's questions: my feeling is that we should go ahead and
> >> >> enable
> >> >> > automated pushes to Docker Hub, and provide some appropriate
> language
> >> >> > around what people should expect out of it. I don't think
> >> >> 'experimental' is
> >> >> > the right word, but we should be clear around exactly what contract
> >> we
> >> >> are
> >> >> > adhering to. Is it something people can expect to be published with
> >> each
> >> >> > 

Re: docker build

2019-03-05 Thread Charles Allen
haha, sorry, I mean I agree with Don

On Tue, Mar 5, 2019 at 9:37 AM Charles Allen  wrote:

> I agree with Gian on this sentiment.
>
> On Tue, Feb 19, 2019 at 7:47 AM Don Bowman  wrote:
>
>> On Mon, 18 Feb 2019 at 19:14, Gaurav Bhatnagar 
>> wrote:
>>
>> > I have been thinking if automated scripts can be provided to end users
>> in
>> > Druid for "Additional Dependencies" for user initiated installation and
>> > configuration of optional dependencies to avoid licensing issues. Later
>> > these scripts can be integrated in admin UI as configuration wizards.
>> >
>> >
>> >
>> Personally I think this is the opposite way the universe is going.
>> People want 'hermetic' images w/ read-only filesystems, named by a single
>> tag or SHA hash. This is what the container universe is about.
>> There's some work to do in druid (e.g. middlemanager logs) to improve this
>> (it currently logs into files in there rather than stdout by default, and
>> expects that elsewhere).
>>
>> w/ a product of the scale of druid, its unlikely to be targetted @ 'small'
>> deployments.
>>
>


Re: docker build

2019-03-05 Thread Charles Allen
I agree with Gian on this sentiment.

On Tue, Feb 19, 2019 at 7:47 AM Don Bowman  wrote:

> On Mon, 18 Feb 2019 at 19:14, Gaurav Bhatnagar  wrote:
>
> > I have been thinking if automated scripts can be provided to end users in
> > Druid for "Additional Dependencies" for user initiated installation and
> > configuration of optional dependencies to avoid licensing issues. Later
> > these scripts can be integrated in admin UI as configuration wizards.
> >
> >
> >
> Personally I think this is the opposite way the universe is going.
> People want 'hermetic' images w/ read-only filesystems, named by a single
> tag or SHA hash. This is what the container universe is about.
> There's some work to do in druid (e.g. middlemanager logs) to improve this
> (it currently logs into files in there rather than stdout by default, and
> expects that elsewhere).
>
> w/ a product of the scale of druid, its unlikely to be targetted @ 'small'
> deployments.
>


Re: Datasketches

2019-02-25 Thread Charles Allen
Basically there are a LOT of issues and PRs that show up when searching for
datasketches in the druid PR list:
https://github.com/apache/incubator-druid/pulls?utf8=%E2%9C%93=datasketches


Maybe just have a label called

Area - Sketches

?


On Mon, Feb 25, 2019 at 11:01 AM Gian Merlino  wrote:

> What scope would you suggest for the label or github project?
>
> There seem to be discussions going on around making DataSketches HLL and/or
> Quantiles more 'default' options for their respective areas -- are you
> thinking that kind of thing?
>
> On Mon, Feb 25, 2019 at 9:57 AM Charles Allen
>  wrote:
>
> > There are a lot of here and there discussions on how to handle sketching
> /
> > hll / histograms / other-stats, and it is getting kind of hard to keep
> > track of them all.
> >
> > In addition, looks like Datasketches is in an incubating proposal stage
> for
> > Apache
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_incubator-2Dgeneral_201902.mbox_-253CCA-252BUaPnt-253DUvbLr-5Fv-2D4-252BYbAmHsAM-2DGqQG-252Bb-253DgOw3BL3Cemj-252BOwSA-2540mail.gmail.com-253E=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=Szv_v6S3DItbN0qP2B1K4mtfj4ybBA-PVuomUFw5PBU=U9qD5_bDYYoUyr5SJZVa-UWB5jNabHXy51TvFHczR8E=
> >
> >
> > I think it is important enough and wide spread enough to have a top level
> > consideration within the druid project. Either a label or a "github
> > project" or something so that things can be tracked easier.
> >
> > Anyone have any opinions or desires here?
> >
> > Thanks,
> > Charles Allen
> >
>


Datasketches

2019-02-25 Thread Charles Allen
There are a lot of here and there discussions on how to handle sketching /
hll / histograms / other-stats, and it is getting kind of hard to keep
track of them all.

In addition, looks like Datasketches is in an incubating proposal stage for
Apache
http://mail-archives.apache.org/mod_mbox/incubator-general/201902.mbox/%3CCA%2BUaPnt%3DUvbLr_v-4%2BYbAmHsAM-GqQG%2Bb%3DgOw3BL3Cemj%2BOwSA%40mail.gmail.com%3E


I think it is important enough and wide spread enough to have a top level
consideration within the druid project. Either a label or a "github
project" or something so that things can be tracked easier.

Anyone have any opinions or desires here?

Thanks,
Charles Allen


Re: Knowledge sharing between Druid developers via technical talks

2019-02-22 Thread Charles Allen
Heads up, one blocker I ran into with recording is that there are a slew of
forms that need signed to be in compliance with Model Release and other
such things. Without a legal team making sure all the paperwork is in line
there might be some risk. Does anyone know if the ASF have any such
guidance on how to handle videos released on behalf of ASF projects?

Cheers,
Charles Allen

On Fri, Feb 22, 2019 at 11:20 AM Furkan KAMACI 
wrote:

> Hi,
>
> It would be really great to organize/attend such tech talks/meetups and
> record them.
>
> On the other hand, we can create a roadmap to touch common points to cover
> i.e.
>
> * How to install Druid for production
> * Tips & Tricks of schema design
> * Advanced queries by example
> * How to tune your Druid infrastructure
> * How to run Machine Learning & Fast Analytics with Druid & Spark
> * Step by step clickstream analysis with Druid
>
> This can be a bundle of learning videos followed by a path for anyone who
> wants to be a profound expert from superficial.
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, Feb 22, 2019 at 9:58 PM Julian Hyde  wrote:
>
> > I like that idea. I always wish that meetups were more about the
> community
> > of contributors (people writing code, answering questions, writing
> > documentation, and pushing the product to new places). But sadly meetups
> > are usually organized by marketing departments.
> >
> > Some conferences (e.g. Hadoop summit, and I suspect FOSDEM, OSCON and
> > Berlin Buzzwords) have BoFs (birds-of-a-feather meetings) that occur in
> the
> > evening after the main conference sessions. They are extremely free
> format,
> > and anyone who shows up can speak. If Druid contributors are heading to
> > such conferences, it’s worth sounding out on this list a few days before.
> > There might be other Druid contributors attending the same conference.
> >
> > Julian
> >
> >
> >
> > > On Feb 22, 2019, at 10:45 AM, Gian Merlino  wrote:
> > >
> > > Could be nice for the last talk in a meetup to be one of these, that
> way
> > > anyone that isn't interested could leave early.
> > >
> > > On Fri, Feb 22, 2019 at 9:51 AM Eyal Yurman <
> eyurma...@verizonmedia.com>
> > > wrote:
> > >
> > >> Thanks for the response, that sounds great!
> > >>
> > >> Since the meetups are user-focused, perhaps a separate "track" which
> is
> > >> open to all but is dev-focused? This could be before/after the main
> > event.
> > >>
> > >> I promise that once I get enough experience with the code base, I'd
> > >> volunteer to present, but hopefully, there are much better candidates
> at
> > >> the moment :)
> > >>
> > >> On Mon, Feb 18, 2019 at 1:36 PM Gian Merlino 
> > >> wrote:
> > >>
> > >>> I am interested especially if the format is something live. An
> > in-person
> > >>> meetup with a recording distributed afterwards would be my
> preference,
> > if
> > >>> people are into that. Maybe something at one of the Druid meetups?
> > >>>
> > >>> On Wed, Feb 13, 2019 at 8:38 PM Eyal Yurman
> > >>>  wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> This is something usually being done in companies, but I think it is
> > >>>> useful
> > >>>> for any community, especially our community which is so distributed.
> > >>>>
> > >>>> I think it would be absolutely wonderful if we can find people
> > willing to
> > >>>> share their knowledge with other contributors via the form of a
> > >>>> tech-talk.
> > >>>> I.e. it would be very useful if someone could take a subject (Just
> for
> > >>>> example, groupBy query) and present the high-level
> > >>>> architecture/implementation.
> > >>>>
> > >>>> I know this requires significant effort, but I hope to convince you
> of
> > >>>> the
> > >>>> benefits it would provide to the Druid project:
> > >>>> - Helping any newcomer being more effective, thus providing better
> > >>>> contribution ROI against work effort.
> > >>>> - Serving as a high-quality medium of communication within the group
> > of
> > >>>> committers, which would lead to more trust and understanding.
> > >>>>
> > >>>> Recording and uploaded such sessions will make them Apache-Way
> > compatible
> > >>>> (Along with serving future viewers).
> > >>>>
> > >>>> So, anyone up to the challenge? :)
> > >>>>
> > >>>> Eyal.
> > >>>>
> > >>>
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>


Dev sync

2019-02-12 Thread Charles Allen
I am unable to host the dev sync this week.

Is anyone finding utility out of these? the dev list seems pretty active
these days, so the legacy utility of the dev sync is very muted (this is a
good thing). Unless people are finding specific utility out of a weekly
video sync up, I propose it be postponed indefinitely until a need can be
identified.

Thoughts?


Re: json logging enable

2019-02-08 Thread Charles Allen
Actually druid uses SLF4J, so you should be able to add in the logback libs
to the classpath and things *should* just work.

On Fri, Feb 8, 2019 at 10:09 AM Don Bowman  wrote:

> thanks!
>
> is there any movement to move to 'logback' from log4j? That would allow
> using e.g. https://github.com/logstash/logstash-logback-encoder
>
> the problem w/ the JsonLayout approach is that it doesn't seem to get
> exceptions properly.
>
> E.g. I get something like below.
>
>
> {"timeMillis":1549648630326,"thread":"main","level":"INFO","loggerName":"org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler","message":"Invoking
> stop method[public void
>
> org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()]
> on
>
> object[org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@61d60e38
>
> ].","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":[]}
> 2019-02-08 17:57:10,342 main ERROR Unable to register shutdown hook because
> JVM is shutting down. java.lang.IllegalStateException: Not started
> at
>
> org.apache.druid.common.config.Log4jShutdown.addShutdownCallback(Log4jShutdown.java:48)
> at
>
> org.apache.logging.log4j.core.impl.Log4jContextFactory.addShutdownCallback(Log4jContextFactory.java:273)
> at
>
> org.apache.logging.log4j.core.LoggerContext.setUpShutdownHook(LoggerContext.java:256)
> at
> org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:216)
> at
>
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:145)
> at
>
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
> at org.apache.logging.log4j.LogManager.getContext(LogManager.java:182)
> at
>
> org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:103)
> at
>
> org.apache.logging.slf4j.Log4jLoggerFactory.getContext(Log4jLoggerFactory.java:43)
> at
>
> org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:42)
> at
>
> org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:29)
> at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:253)
> at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:265)
> at
>
> org.apache.curator.utils.CloseableExecutorService.(CloseableExecutorService.java:40)
> at
>
> org.apache.druid.curator.cache.PathChildrenCacheFactory.make(PathChildrenCacheFactory.java:55)
> at
>
> org.apache.druid.curator.inventory.CuratorInventoryManager.start(CuratorInventoryManager.java:109)
> at
>
> org.apache.druid.client.AbstractCuratorServerInventoryView.start(AbstractCuratorServerInventoryView.java:168)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
>
> org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:427)
> at
>
> org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:323)
> at org.apache.druid.guice.LifecycleModule$2.start(LifecycleModule.java:138)
> at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:107)
> at org.apache.druid.cli.ServerRunnable.run(ServerRunnable.java:58)
> at org.apache.druid.cli.Main.main(Main.java:118)
>
>
>
> On Fri, 8 Feb 2019 at 12:27, Charles Allen  .invalid>
> wrote:
>
> > Structured logging is not in a very good state in Druid right now (or the
> > industry in general). Part of the issue is that the log4j standard json
> > format is not very compatible with modern json logging systems. I even
> > modified the sumologic appender to get better productionized items into
> it
> > at https://github.com/metamx/sumologic-log4j2-appender . I do not know
> if
> > there is a stackdriver friendly log4j formatter out there, but it would
> not
> > surprise me if there was but it pulled in guava version future+inifnity.
> >
> > My current log4j2 layout
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__logging.apache.org_log4j_2.x_manual_layouts.html-23JSONLayout=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=v3J-Uid1SfjBwbQj0bygNz7-FkK1Cw4ubdPwAFDUeLk=s49QFMDklbZ3qj13P53cf8fzxtS-phSdM2B3OeWfuNw=>
> line
> > looks like this:
> >  > "true" />
> >

Re: Forbiddenapis Plugin

2019-01-31 Thread Charles Allen
Is this indicative of latent bugs the generated sources have?

On Thu, Jan 31, 2019 at 8:55 AM Gian Merlino  wrote:

> I get those sometimes with generated sources -- typically doing a "mvn
> clean" beforehand clears it up. We might be able to add exclusions for the
> generated source directories in order to avoid the need to do this.
>
> On Thu, Jan 31, 2019 at 5:15 AM Furkan KAMACI 
> wrote:
>
> > I try to run forbiddenapis plugin at Druid. However I get that errors but
> > does not know where they actually points:
> >
> > [INFO] Scanning classes for violations...
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.BinaryEvalOpExprBase (Expr.java,
> > method body of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.LongExpr (Expr.java, method body
> of
> > '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.FunctionExpr (Expr.java, method
> > body of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.data.input.impl.InputRowParser
> > (InputRowParser.java, method body of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.BinAndExpr (Expr.java, method
> body
> > of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.java.util.common.concurrent.Execs
> > (Execs.java, method body of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.BinOrExpr (Expr.java, method body
> > of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.StringExpr (Expr.java, method
> body
> > of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.DoubleExpr (Expr.java, method
> body
> > of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.UnaryMinusExpr (Expr.java, method
> > body of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.UnaryNotExpr (Expr.java, method
> > body of '$$$reportNull$$$0(int)')
> > [ERROR] Forbidden method invocation:
> > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> default
> > locale]
> > [ERROR]   in org.apache.druid.math.expr.IdentifierExpr (Expr.java, method
> > body of '$$$reportNull$$$0(int)')
> > [ERROR] Scanned 714 class file(s) for forbidden API invocations (in
> 0.65s),
> > 12 error(s).
> >
> > Do you have any idea?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
>


Re: Slow download of segments from deep storage

2019-01-30 Thread Charles Allen
I *think* the HTTP coordination already enables this

On Wed, Jan 30, 2019 at 4:20 PM Samarth Jain  wrote:

> We noticed that it takes a long time for the historicals to download
> segments from deep storage (in our case S3). Looking closer at the code in
> ZKCoordinator, I noticed that the segment download is happening in a single
> threaded fashion. This download happens in the SingleThreadedExecutor
> service used by the PathChildrenCache. Looking at the commentary on
> https://github.com/apache/incubator-druid/issues/4421 and
> https://github.com/apache/incubator-druid/issues/3202, the executor
> service
> used in PathChildrenCache can only be single threaded.
>
> My proposal is to use a multi threaded ExecutorService that will be used to
> take action on the  events to perform the download. The role of single
> threaded ExecutorService in PathChildrenCache will be simply to delegate
> the download task to this new executor service.
>
> Does that sound feasible? IMO, if this happens to be functionally correct,
> it should help significantly boost up the time it is taking historicals to
> download all the assigned segments.
>
> I would be more than happy to contribute this enhancement to the community.
>
> Thanks,
> Samarth
>


Re: Dev sync this week

2019-01-29 Thread Charles Allen
sorry, yes, that is the information I meant. that Feature Freeze is in its
final stages

On Tue, Jan 29, 2019 at 12:21 PM David Glasser 
wrote:

> On Tue, Jan 29, 2019 at 10:45 AM Charles Allen
>  wrote:
> >* *0.14 *release is in final stages, any blockers for 0.14 should be
> called out to the dev list.
>
> Can you clarify this? I don't think a 0.14 branch has been cut yet —
> things that land on master soon are likely to make it in, right?
>
> --dave (hoping a tiny approved PR of his makes it :) )
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Dev sync today

2019-01-22 Thread Charles Allen
My computer is having issues. Can anyone else start the dev sync today?


Re: Docker image

2019-01-16 Thread Charles Allen
The idea has been toyed around with internally here. What would your
expectations be of such an image?


On Wed, Jan 16, 2019 at 2:35 PM Don Bowman  wrote:

> Is anyone working on a docker image? I mean, there are quite a few out
> there but they have some various issues, usually security based as they
> inherit from non-too-strong bases.
>
> I have done one w/ gcr.io/distroless/java as the parent, and it seems
> working, but not sure if there is a reason or strategy for not having one
> in the repo and built by travis to dockerhub.
>
> Some of us would like to be deploying via helm in kubernetes and this is
> causing it to be a bit complex.
>


Sync up this week

2019-01-15 Thread Charles Allen
To join the video meeting, click this link:
https://meet.google.com/ozi-rtfg-ags
Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
To view more phone numbers, click this link:
https://tel.meet/ozi-rtfg-ags?hs=5

Cheers!


Re: dev sync this week

2019-01-08 Thread Charles Allen
not much interest this week, ducking out

On Tue, Jan 8, 2019 at 10:06 AM Charles Allen 
wrote:

> https://meet.google.com/ozi-rtfg-ags
>
> sorry for the late notice
>


Re: Watermarks!

2019-01-07 Thread Charles Allen
I'll answer the last question first:

Many data groups are processed via Airflow, so having a batch component
compatible with Airflow is more impactful than being able to live stream
data as it stands right now. I'm constantly on the lookout for a use case
where druid streaming is a good fit for a solution (as opposed to
Graphite/grafana, or even potentially prometheus) but haven't found one yet
where the overhead for maintaining the extra realtime and streaming system
is worth the payout. From a technology investment point of view, a Beam
compatible sink (which we have an internal one based on tranquility for
streaming sinks) might end up working. I am interested to see if the KIS
features can be leveraged to work with systems outside of kafka. Also of
great interest is to see if the "resources per task" can be made more
tunable instead of being a single cookie cutter footprint. The need for
huge resources during the final merge-and-push phase compared to the
incremental intake phase is also a major pain point and cause of
inefficiency for Druid streaming stuff.

Watermarking *could* tell if segments are unavailable (i.e. a whole hour of
data is missing) and fail the query accordingly if the watermark cursor was
not advanced beyond the interval end. I have not attempted to put such an
interrupt into the query layer though. It is a very intriguing idea. In
general the cursors work by monitoring the segment availability
announcements and watches for certain criteria to be met before advancing.
A very simple example here would be to halt a watermark's progression until
at least *some* data for a time range is available in some segment
somewhere. A more advanced cursor would have a concept of "completeness"
and only advance the watermark once some time range has reached some
"complete" criteria (number of events, or signal from external system could
make sense).

The nice thing here is also with automated checks, which can wait until the
watermark has progressed before querying the druid cluster for some data.

Hopefully that answers some questions,
Charles Allen


On Mon, Jan 7, 2019 at 12:50 PM Gian Merlino  wrote:

> For Kafka, maybe something that tells you if all committed data is actually
> loaded, & what offset has been committed up to? Would there by any problems
> caused by the fact that only the most recent commit is saved in the DB?
>
> Is this feature connected at all to an ask I have heard from a few people:
> that there be an option to fail a query (or at least include a special
> response header) if some segments in the interval are unavailable? (Which,
> currently, the broker can't know since it doesn't know details about all
> available segments.)
>
> Btw, at your site do you have any plans to migrate to Kafka indexing?
>
> On Wed, Jan 2, 2019 at 5:37 PM Charles Allen  .invalid>
> wrote:
>
> > Hi all!
> >
> > https://github.com/apache/incubator-druid/pull/6799
> >
> > A contribution is up that includes a neat feature we have been using
> > internally called Watermarks. Basically when operating a large scale and
> > multi-tenant system, it is handy to be able to monitor how 'well behaved'
> > the data is with regard to history. This is commonly used to spot holes
> in
> > data, and to help give hints to data consumers in a lambda environment on
> > when data has been run through a thorough check (batch job) vs a best
> > effort sketch of the results which may or may not handle late data well
> > (streaming intake).
> >
> > Unfortunately i'm not really sure what meta-data would be handy to have
> for
> > the kafka indexing service, so I'd love input there as well if anyone
> knows
> > of any "watermarks" that would make sense for it.
> >
> > Since the extension was written to be a stand alone service, it can
> remain
> > as an extension forever if desired. An alternative I would like to
> propose
> > is that the primitives for the watermark feature be added to core druid,
> > and the extension points be added to their respective places (mysql
> > extension and google extension to name two explicitly).
> >
> > Let me know what you think!
> > Charles Allen
> >
>


Watermarks!

2019-01-02 Thread Charles Allen
Hi all!

https://github.com/apache/incubator-druid/pull/6799

A contribution is up that includes a neat feature we have been using
internally called Watermarks. Basically when operating a large scale and
multi-tenant system, it is handy to be able to monitor how 'well behaved'
the data is with regard to history. This is commonly used to spot holes in
data, and to help give hints to data consumers in a lambda environment on
when data has been run through a thorough check (batch job) vs a best
effort sketch of the results which may or may not handle late data well
(streaming intake).

Unfortunately i'm not really sure what meta-data would be handy to have for
the kafka indexing service, so I'd love input there as well if anyone knows
of any "watermarks" that would make sense for it.

Since the extension was written to be a stand alone service, it can remain
as an extension forever if desired. An alternative I would like to propose
is that the primitives for the watermark feature be added to core druid,
and the extension points be added to their respective places (mysql
extension and google extension to name two explicitly).

Let me know what you think!
Charles Allen


Re: Writing a Druid extension

2019-01-02 Thread Charles Allen
https://github.com/apache/incubator-druid/pull/6798 Please check it out
Nikita

On Sat, Dec 29, 2018 at 11:59 PM Nikita Dolgov 
wrote:

> I was experimenting with a Druid extension prototype and encountered some
> difficulties. The experiment is to build something like
> https://github.com/apache/incubator-druid/issues/3891 with gRPC.
>
> (1) Guava version
>
> Druid relies on 16.0.1 which is a very old version (~4 years). My only
> guess is another transitive dependency (Hadoop?) requires it. The earliest
> version used by gRPC from three years ago was 19.0. So my first question is
> if there are any plans for upgrading Guava any time soon.
>
> (2) Druid thread model for query execution
>
> I played a little with calling
> org.apache.druid.server.QueryLifecycleFactory::runSimple under debugger.
> The stack trace was rather deep to reverse engineer easily so I'd like to
> ask directly instead. Would it be possible to briefly explain how many
> threads (and from which thread pool) it takes on a broker node to process,
> say, a GroupBy query.
>
> At the very least I'd like to know if calling
> QueryLifecycleFactory::runSimple on a thread from some "query processing
> pool" is better than doing it on the IO thread that received the query.
>
> (3) Yielder
>
> Is it safe to assume that QueryLifecycleFactory::runSimple always returns
> a Yielder ? QueryLifecycle omits generic
> types rather liberally when dealing with Sequence instances.
>
> (4) Calcite integration
>
> Presumably Avatica has an option of using protobuf encoding for the
> returned results. Is it true that Druid cannot use it?
> On a related note, any chance there was something written down about
> org.apache.druid.sql.calcite ?
>
> Thank you
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Writing a Druid extension

2019-01-02 Thread Charles Allen
We have a functional gRPC extension for brokers internally. Let me see if I
can get approval for releasing it.

For the explicit answers:

1) Guava 16

Yep, druid is stuck on it due to hadoop.
https://github.com/apache/incubator-druid/pull/5413 is the only outstanding
issue I know of that would a very wide swath of guava implementations to be
used. Once a solution for the same thread executor service gets into place,
then you should be able to modify your local deployment to whatever guava
version fits with your indexing config.

2) Group By thread processing

You picked the hardest one here :) there is all kinds of multi-threaded fun
that can show up when dealing with group by queries. If you want a good
dive into this I suggest checking out
https://github.com/apache/incubator-druid/pull/6629 which will put you
straight into the weeds of it all.

3) Yielder / Sequence type safety

Yeah... I don't have any good info there other than "things aren't
currently broken". There are some really nasty and hacky type casts related
to by segment sequences if you start digging around the code.

4) Calcite Proto

This is a great question. I imagine getting a Calcite Proto SQL endpoint
setup in an extension wouldn't be too hard, but have not tried such a
thing. This one would probably be worth having its own discussion thread
(maybe an issue?) on how to handle.

You are on the right track!
Charles Allen

On Sat, Dec 29, 2018 at 11:59 PM Nikita Dolgov 
wrote:

> I was experimenting with a Druid extension prototype and encountered some
> difficulties. The experiment is to build something like
> https://github.com/apache/incubator-druid/issues/3891 with gRPC.
>
> (1) Guava version
>
> Druid relies on 16.0.1 which is a very old version (~4 years). My only
> guess is another transitive dependency (Hadoop?) requires it. The earliest
> version used by gRPC from three years ago was 19.0. So my first question is
> if there are any plans for upgrading Guava any time soon.
>
> (2) Druid thread model for query execution
>
> I played a little with calling
> org.apache.druid.server.QueryLifecycleFactory::runSimple under debugger.
> The stack trace was rather deep to reverse engineer easily so I'd like to
> ask directly instead. Would it be possible to briefly explain how many
> threads (and from which thread pool) it takes on a broker node to process,
> say, a GroupBy query.
>
> At the very least I'd like to know if calling
> QueryLifecycleFactory::runSimple on a thread from some "query processing
> pool" is better than doing it on the IO thread that received the query.
>
> (3) Yielder
>
> Is it safe to assume that QueryLifecycleFactory::runSimple always returns
> a Yielder ? QueryLifecycle omits generic
> types rather liberally when dealing with Sequence instances.
>
> (4) Calcite integration
>
> Presumably Avatica has an option of using protobuf encoding for the
> returned results. Is it true that Druid cannot use it?
> On a related note, any chance there was something written down about
> org.apache.druid.sql.calcite ?
>
> Thank you
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Drop 0. from the version

2018-12-21 Thread Charles Allen
If I'm greedily honest, I don't want to maintain that many backport
channels. I'd rather have "If you want XYZ backport for version 14, then
you have to take the latest minor version for 14" and have a policy to
where someone can upgrade from 14.x --> 14.latest with (hopefully) no
config changes.




On Fri, Dec 21, 2018 at 9:03 AM David Glasser 
wrote:

> One nice advantage to moving out of 0.x is that it frees up a digit on the
> right side to more cleanly differentiate between "minor release (a random
> assortment of bug fixes, small features, etc)" and "patch release
> (literally the minimum delta to give you a security fix or high impact bug
> fix)".
>
> --dave
>
> On Fri, Dec 21, 2018 at 8:58 AM Gian Merlino  wrote:
>
> > I'm not too fussy around whether we do a 1.0 or simply drop the 0. and
> have
> > it be a 14.0 or 15.0 or 16.0 or wherever we are at the time we do it. I
> > also like the quarterly cadence of release-from-master we had before we
> got
> > blocked on the ASF transition, and would like to pick that back up again
> > (with the next branch cut from master at the end of January, since we did
> > the 0.13.0 branch cut in late October).
> >
> > Seems to me that good points of discussion are, what should we use as the
> > rule for incrementing the major version? Do we do what we've been doing
> > (incrementing whenever there's either an incompatible change in extension
> > APIs, or in query APIs, or when necessary to preserve the ability to
> always
> > be able to roll forward/back one major release). Or do we do something
> else
> > (Roman seems to be suggesting dropping extension APIs from
> consideration).
> >
> > And also, what does 1.0 or 14.0 or 15.0 or what-have-you mean to us? Is
> it
> > something that should be tied to ASF graduation? Completeness of vision?
> > Stability of APIs or operational characteristics? Something else? You are
> > right that it is sort of a marketing/mentality thing, so it's an
> > opportunity for us to declare that we feel Druid has reached some
> > milestone. My feeling at this time is probably ASF graduation or
> > completeness of vision (see my earlier mail for thoughts there) are the
> > ones that make most sense to me.
> >
> > On Fri, Dec 21, 2018 at 10:41 AM Charles Allen 
> wrote:
> >
> > > Is there any feeling in the community that the logic behind the
> releases
> > > needs to change?
> > >
> > > If so then I think we should discuss what that release cadence needs to
> > > look like.
> > >
> > > If not then dropping the 0. prefix is a marketing / mental item. Kind
> of
> > > like the 3.x->4.x Linux kernel upgrade. If this is the case then would
> we
> > > even want to go with 1.x? I think Roman's proposal would work fine in
> > this
> > > case. Where we just call it Apache Druid 14 (or 15 or whatever it is
> when
> > > we get there) and just keep the same logic for when we release stuff,
> > which
> > > has been something like:
> > >
> > > For a X.Y release, going to a X.? release should be very straight
> forward
> > > for anyone running stock Druid.
> > > For a X.Y release, going to a (X+1).? or from a (X+1).? back to an X.Y
> > > release should be feasible. It might require running a tool supported
> by
> > > the community.
> > > For a X.Y release, going to an (X+2).? or an (X-2).? is not supported.
> > Some
> > > things that will not have tools might have warning logs printed that
> the
> > > functionality will change (should we change these to alerts?)
> > >
> > > If this sounds reasonable then jumping straight to Apache Druid 14 on
> the
> > > first official apache release would make a lot of sense.
> > >
> > > Cheers,
> > > Charles Allen
> > >
> > >
> > > On Thu, Dec 20, 2018 at 11:07 PM Gian Merlino  wrote:
> > >
> > > > I think it's a good point. Culturally we have been willing to break
> > > > extension APIs for relatively small benefits. But we have generally
> > been
> > > > unwilling to make breaking changes on the operations side quite so
> > > > liberally. Also, most cluster operators don't have their own custom
> > > > extensions, in my experience. So it does make sense to differentiate
> > > them.
> > > > I'm not sure how it makes sense to differentiate them, though. It
> could
> > > be
> > > > done through the version number (only increment the major v

Re: Drop 0. from the version

2018-12-21 Thread Charles Allen
Is there any feeling in the community that the logic behind the releases
needs to change?

If so then I think we should discuss what that release cadence needs to
look like.

If not then dropping the 0. prefix is a marketing / mental item. Kind of
like the 3.x->4.x Linux kernel upgrade. If this is the case then would we
even want to go with 1.x? I think Roman's proposal would work fine in this
case. Where we just call it Apache Druid 14 (or 15 or whatever it is when
we get there) and just keep the same logic for when we release stuff, which
has been something like:

For a X.Y release, going to a X.? release should be very straight forward
for anyone running stock Druid.
For a X.Y release, going to a (X+1).? or from a (X+1).? back to an X.Y
release should be feasible. It might require running a tool supported by
the community.
For a X.Y release, going to an (X+2).? or an (X-2).? is not supported. Some
things that will not have tools might have warning logs printed that the
functionality will change (should we change these to alerts?)

If this sounds reasonable then jumping straight to Apache Druid 14 on the
first official apache release would make a lot of sense.

Cheers,
Charles Allen


On Thu, Dec 20, 2018 at 11:07 PM Gian Merlino  wrote:

> I think it's a good point. Culturally we have been willing to break
> extension APIs for relatively small benefits. But we have generally been
> unwilling to make breaking changes on the operations side quite so
> liberally. Also, most cluster operators don't have their own custom
> extensions, in my experience. So it does make sense to differentiate them.
> I'm not sure how it makes sense to differentiate them, though. It could be
> done through the version number (only increment the major version for
> operations breaking changes) or it could be done through an "upgrading"
> guide in the documentation (increment the major version for operations or
> extension breaking changes, but, have a guide that tells people which
> versions have operations breaking changes to aid in upgrades).
>
> Coming back to the question in the subject of your mail: IMO, for
> "graduation" out of 0.x, we should talk as a community about what that
> means to us. It is a milestone that on the one hand, doesn't mean much, but
> on the other hand, can be deeply symbolic. Some things that it has meant to
> other projects:
>
> 1) Production readiness. Obviously Druid is well past this. If this is what
> dropping the 0. means, then we should do it immediately.
>
> 2) Belief that the APIs have become relatively stable. Like you said, the
> extension APIs don't seem particularly close to stable, but maybe that's
> okay. However, the pace of breaking changes on the operations and query
> side for non-experimental features has been relatively calm for the past
> couple of years, so if we focus on that then we can make a case here.
>
> 3) Completeness of vision. This one is the most interesting to me. I
> suspect that different people in the community have different visions for
> Druid. It is also the kind of project that may never truly be complete in
> vision (in principle, the platform could become a competitive data
> warehouse, search engine, etc, …). For what it's worth, my vision of Druid
> for the next year at least involves robust stream ingestion being a first
> class ingestion method (Kafka / Kinesis indexing service style) and SQL
> being a first class query language. These are both, today, still
> experimental features. So are lookups. All of these 3 features, from what I
> can see, are quite popular amongst Druid users despite being experimental.
> For a 'completeness of vision' based 1.0 I would want to lift all of those
> out of experimental status and, for SQL in particular, to have its
> functionality rounded out a bit more (to support the native query features
> it doesn't currently support, like multi-value dimensions, datasketches,
> etc).
>
> 4) Marketing / timing. Like, doing a 1.0 around the time we graduate from
> the Incubator. Not sure how much this really matters, but projects do it
> sometimes.
>
> Another question is, how often do we intend to rev the version? At the rate
> we're going, we rev 2-3 major versions a year. Would we intend to keep that
> up, or slow it down by making more of an effort to avoid breaking changes?
>
> On Thu, Dec 20, 2018 at 2:17 PM Roman Leventov 
> wrote:
>
> > It may also make sense to distinguish "operations" breaking changes from
> > API breaking changes. Operations breaking changes establish the minimum
> > cadence of Druid cluster upgrades, that allow rolling Druid versions back
> > and forward. I. e. it's related to segment format, the format of the data
> > kept in ZooKeeper and the SQL database, or events such as

Unable to start dev sync this week

2018-12-18 Thread Charles Allen
I'm unable to start the dev sync this week. Can someone else start it up?

Thanks,
Charles Allenn


Re: This week's dev sync

2018-12-11 Thread Charles Allen
Notes:
* Charles is official note taker for this session.
* Release is looking good so far. David working on getting total release
cut.
* Unsure what the status of the website build for release is. If there are
blockers it is asked to be called out in the dev list.
* Charles mentioned https://github.com/apache/incubator-druid/pull/5913 and
its failure in group by queries as a main blocker for adoption. Jihoon has
https://github.com/apache/incubator-druid/pull/6629 as an alternative
approach to a very similar problem. The authors in question have been in
sync that parallel development was a risk so this is not a surprise to
either of us.
* Clint mentioned lots of pretty big PRs outstanding.
* There is a growing interest in various groups about result aggregations
like moving averages and cumulative totals. If there are pockets of effort
in similar post-query processing or result level processing, please make
sure it is known in the community.
* It is proposed the dev sync for Dec 25th and Jan 1st be skipped.


On Tue, Dec 11, 2018 at 9:51 AM Charles Allen 
wrote:

> To join the video meeting, click this link:
> https://meet.google.com/ozi-rtfg-ags
> Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
> To view more phone numbers, click this link:
> https://tel.meet/ozi-rtfg-ags?hs=5
>


This week's dev sync

2018-12-11 Thread Charles Allen
To join the video meeting, click this link:
https://meet.google.com/ozi-rtfg-ags
Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
To view more phone numbers, click this link:
https://tel.meet/ozi-rtfg-ags?hs=5


Dev sync this week

2018-11-27 Thread Charles Allen
I have a conflict for the meeting this week (again, unfortunately)

Is anyone else able to start the meeting?


Sync up this week

2018-11-13 Thread Charles Allen
Hi all!

I have an off-site today so will not be able to host the sync up. Is anyone
else able to host?

Thank you,
Charles Allen


Dev sync this week

2018-11-06 Thread Charles Allen
I have a conflict. Can someone else start the dev sync this week?

Thank you,
Charles Allen


Bug in Jackson Smile 2.6.5 (-Pspark2)

2018-10-30 Thread Charles Allen
If anyone is building druid with -Pspark2, there is a weird bug in older
versions of Smile documented in
https://github.com/apache/incubator-druid/issues/6553


This week's dev sync

2018-10-30 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags

Cheers!
Charles Allen


Dev sync this week

2018-10-23 Thread Charles Allen
I have a conflict this week. Is anyone else able to start the sync up?


Re: Druid + Theta Sketches performance

2018-10-22 Thread Charles Allen
Honestly I do not remember how the dimension exclusion vs dimension
inclusion stuff works. I have to look it up every time. If you look at any
segment for that datasource in the Coordinator Console, it should give you
a list of dimensions and metrics. Do they match what you expect?

On Sun, Oct 21, 2018 at 9:17 PM alex.rnv...@gmail.com 
wrote:

>
>
> On 2018/10/19 14:42:18, Charles Allen 
> wrote:
> > This is a good callout. Those numbers still seem very slow. One item I'm
> > curious of is if you are dropping the id when you index, or if the id is
> > also being indexed into the druid segments.
> >
> > With how druid does indexing, it dictionary encodes all the dimension
> > values. So the cardinality of rows is a factor of QueryGranularity and
> the
> > cardinality of dimension value tuples per query granularity "bucket".
> This
> > allows dynamic slice and dice on the data. But if it is accidentally
> > including a dimension with very high cardinality (like ID) in the
> > dictionary encoding, then it is not able to make efficient use of
> roll-up.
> >
> > In order to facilitate dynamic slice and dice, the theta sketches need to
> > have *some* kind of object stored per dimension tuple per query
> granularity
> > (but only if the tuple appears in that bucket). So you can reduce the
> > number of things that get read off of disk by trying to increase the
> > rollup. Usually this is done by dropping or reducing high cardinality
> > dimensions, but can also be done by changing the query granularity.
> >
> > Another trick is to use topN or Timeseries. In general, those query types
> > are able to able to have better optimizations since they have a very
> > limited scope use case.
> >
> > Now, to Theta Sketches itself, I am not as familiar with the Theta
> Sketches
> > code paths. It is possible there are performance gains to be had.
> >
> > Hope this helps,
> > Charles Allen
> >
> >
> > On Fri, Oct 19, 2018 at 3:38 AM alex.rnv...@gmail.com <
> alex.rnv...@gmail.com>
> > wrote:
> >
> > > Hi Druid devs,
> > > I am testing Druid for our specific count distinct estimation case.
> Data
> > > was ingested via Hadoop indexer.
> > > When simplified, it has following schema:
> > > timestampkeycountrytheta-sketchevent-counter
> > > So, there are 2 dimensions, one counter metric, one theta sketch
> metric.
> > > Data granularity is a DAY.
> > > Data source in deep storage is 150-200GB per day.
> > >
> > > I was doing some test runs with our small test cluster (4 Historical
> > > nodes, 8 CPU, 64GB RAM, 500SSD RAM). I admit with this RAM-SSD ratio
> and
> > > number of nodes it is not going to be fast. The question though is in
> > > theta-sketches performance compared to counters aggregation. The
> difference
> > > is an order of magnitude. E.g.: GroupBy query for a single key,
> aggregated
> > > on 7 days:
> > > event-counters - 30 seconds.
> > > theta-sketches -  7 minutes.
> > >
> > > Theta Sketch aggregation implies more work than summing up numbers of
> > > course. But Theta Sketch documentation says that union operation is
> very
> > > fast.
> > >
> > > I did some profiling of one of Historical nodes. Most of CPU time is
> spent
> > > in
> > >
> io.druid.query.aggregation.datasketches.theta.SketchObjectStrategy.fromByteBuffer(ByteBuffer,
> > > int). Which I think is moving Sketch objects from off-heap to managed
> heap.
> > > To be precise, time is spent in sketch library methods
> > > com.yahoo.memory.WritableMemoryImpl.region
> > > com.yahoo.memory.Memory.wrap
> > >
> > > Do not think anything is wrong with this code, except for why is it
> called
> > > so many times.
> > > Which leads to main question. I do not really understand how
> theta-sketch
> > > is stored in columnar database. Assuming it is stored same way as
> counter,
> > > it means that for every combination of "key" and "country" (dimensions
> from
> > > above) - there is a theta sketch structure to be stored. In our case
> "key"
> > > cardinality is quite high. Hence so many Sketch structure accesses in
> Java.
> > > Looks extremely ineffective. Again, it is just an assumption. Please
> excuse
> > > me if am wrong here.
> > >
> > > If you continue thinking in this direction, in terms of performance it
> > > makes sense to store one Theta sketch for 

Re: Druid + Theta Sketches performance

2018-10-19 Thread Charles Allen
This is a good callout. Those numbers still seem very slow. One item I'm
curious of is if you are dropping the id when you index, or if the id is
also being indexed into the druid segments.

With how druid does indexing, it dictionary encodes all the dimension
values. So the cardinality of rows is a factor of QueryGranularity and the
cardinality of dimension value tuples per query granularity "bucket". This
allows dynamic slice and dice on the data. But if it is accidentally
including a dimension with very high cardinality (like ID) in the
dictionary encoding, then it is not able to make efficient use of roll-up.

In order to facilitate dynamic slice and dice, the theta sketches need to
have *some* kind of object stored per dimension tuple per query granularity
(but only if the tuple appears in that bucket). So you can reduce the
number of things that get read off of disk by trying to increase the
rollup. Usually this is done by dropping or reducing high cardinality
dimensions, but can also be done by changing the query granularity.

Another trick is to use topN or Timeseries. In general, those query types
are able to able to have better optimizations since they have a very
limited scope use case.

Now, to Theta Sketches itself, I am not as familiar with the Theta Sketches
code paths. It is possible there are performance gains to be had.

Hope this helps,
Charles Allen


On Fri, Oct 19, 2018 at 3:38 AM alex.rnv...@gmail.com 
wrote:

> Hi Druid devs,
> I am testing Druid for our specific count distinct estimation case. Data
> was ingested via Hadoop indexer.
> When simplified, it has following schema:
> timestampkeycountrytheta-sketchevent-counter
> So, there are 2 dimensions, one counter metric, one theta sketch metric.
> Data granularity is a DAY.
> Data source in deep storage is 150-200GB per day.
>
> I was doing some test runs with our small test cluster (4 Historical
> nodes, 8 CPU, 64GB RAM, 500SSD RAM). I admit with this RAM-SSD ratio and
> number of nodes it is not going to be fast. The question though is in
> theta-sketches performance compared to counters aggregation. The difference
> is an order of magnitude. E.g.: GroupBy query for a single key, aggregated
> on 7 days:
> event-counters - 30 seconds.
> theta-sketches -  7 minutes.
>
> Theta Sketch aggregation implies more work than summing up numbers of
> course. But Theta Sketch documentation says that union operation is very
> fast.
>
> I did some profiling of one of Historical nodes. Most of CPU time is spent
> in
> io.druid.query.aggregation.datasketches.theta.SketchObjectStrategy.fromByteBuffer(ByteBuffer,
> int). Which I think is moving Sketch objects from off-heap to managed heap.
> To be precise, time is spent in sketch library methods
> com.yahoo.memory.WritableMemoryImpl.region
> com.yahoo.memory.Memory.wrap
>
> Do not think anything is wrong with this code, except for why is it called
> so many times.
> Which leads to main question. I do not really understand how theta-sketch
> is stored in columnar database. Assuming it is stored same way as counter,
> it means that for every combination of "key" and "country" (dimensions from
> above) - there is a theta sketch structure to be stored. In our case "key"
> cardinality is quite high. Hence so many Sketch structure accesses in Java.
> Looks extremely ineffective. Again, it is just an assumption. Please excuse
> me if am wrong here.
>
> If you continue thinking in this direction, in terms of performance it
> makes sense to store one Theta sketch for every dimension value, so instead
> of having cardinality(key) * cardinality(countries) entries there will be
> cardinality(key) + cardinality(countries) sketches. In this case it looks
> like an index, not a part of columnar storage itself.
> Queries for 2 dimensions are easy, as there is only one INTERSECTION to be
> done. It all looks like a natural thing to do for sketches, as there will
> be a win in terms of storage and query performance.
> My question is if I am right or wrong in my assumptions. If my
> understanding is not correct and sketches are already stored in optimal
> way, could someone give advice on speeding up computations on a single
> Historical node? Otherwise, wanted to ask if there is an attempt or
> discussion to use sketches in the way I described.
> Thanks in advance.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Dev Sync

2018-10-16 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags

This week's dev sync

Cheers!
Charles Allen


Re: TopN on two metrics

2018-10-11 Thread Charles Allen
yes, you are correct

On Thu, Oct 11, 2018 at 6:24 AM Abhishek Kaushik 
wrote:

> So, something like this should work then:
>
>  "aggregations":[
>   {
>  "fieldName":"metric1",
>  "name":"metric1",
>  "type":"longSum"
>   },
>   {
>  "fieldName":"metric2",
>  "name":"metric2",
>  "type":"longSum"
>   }
>],
>"postAggregations":[
>   {
>  "name":"result",
>  "fn":"+",
>  "type":"arithmetic",
>  "fields":[
> {
>"fieldName":"metric1",
>"name":"metric1",
>    "type":"fieldAccess"
> },
> {
>"fieldName":"metric2",
>"name":"metric2",
>"type":"fieldAccess"
> }
>  ]
>   }
>],
> ...
> "metric": "result"
> ...
> as the config.
>
> On Thu, Oct 11, 2018 at 6:49 PM Charles Allen
>  wrote:
>
> > For the vast majority of use cases, Yes! For example, lets say you have
> two
> > metrics "Cost" and "Taxes" as USD cents. When you add the two together
> you
> > get the total that was charged (or something similar).
> >
> > To get the total that was charged across everything, you simply sum the
> > Cost and sum the Taxes as aggregators, then do what's called a
> > post-aggregator to sum those two aggregators.
> >
> > The place where this DOESN'T work well is if Cost is is local monies, and
> > you want to convert every event to USD and then sum them. The easiest
> way I
> > know to handle such a thing is to do the conversion during your initial
> > data cleanup and processing. An example on a similar vein is if you want
> to
> > do "inflation adjusted USD". For these two scenarios I'd really have to
> > think about if there's a clean way to do the calculation; no immediate
> one
> > comes to mind. In these scenarios the way I can think of would be:
> >
> > A) Do a topN (or groupBy) against the currency type, then do some client
> > side aggregation to convert the per-currency result into a constant
> > currency value
> > B) Do a timeseries, and do the per-time-bucket conversion on the client
> > side, then do the final aggregation on the client side as well.
> >
> > Hopefully that clarifies things a bit.
> >
> >
> > On Thu, Oct 11, 2018 at 6:07 AM Abhishek Kaushik 
> > wrote:
> >
> > > Hi,
> > > Suppose I have two metrics A and B in my dataset. I need to have a TopN
> > > query on the aggregated combination of both (here A+B). Is it possible
> in
> > > druid?
> > >
> >
>


Re: TopN on two metrics

2018-10-11 Thread Charles Allen
For the vast majority of use cases, Yes! For example, lets say you have two
metrics "Cost" and "Taxes" as USD cents. When you add the two together you
get the total that was charged (or something similar).

To get the total that was charged across everything, you simply sum the
Cost and sum the Taxes as aggregators, then do what's called a
post-aggregator to sum those two aggregators.

The place where this DOESN'T work well is if Cost is is local monies, and
you want to convert every event to USD and then sum them. The easiest way I
know to handle such a thing is to do the conversion during your initial
data cleanup and processing. An example on a similar vein is if you want to
do "inflation adjusted USD". For these two scenarios I'd really have to
think about if there's a clean way to do the calculation; no immediate one
comes to mind. In these scenarios the way I can think of would be:

A) Do a topN (or groupBy) against the currency type, then do some client
side aggregation to convert the per-currency result into a constant
currency value
B) Do a timeseries, and do the per-time-bucket conversion on the client
side, then do the final aggregation on the client side as well.

Hopefully that clarifies things a bit.


On Thu, Oct 11, 2018 at 6:07 AM Abhishek Kaushik 
wrote:

> Hi,
> Suppose I have two metrics A and B in my dataset. I need to have a TopN
> query on the aggregated combination of both (here A+B). Is it possible in
> druid?
>


Re: [EXTERNAL] This week's dev sync

2018-10-09 Thread Charles Allen
great question!

They typically occur at 10am pacific time every Tuesday

On Tue, Oct 9, 2018 at 10:59 AM Mohammad.J.Khan 
wrote:

> Hi Charles,
>
> What time do the week's dev sync take place?
>
> Thanks,
> Mohammad
>
> On 10/9/18, 11:55 AM, "Charles Allen" 
> wrote:
>
> https://meet.google.com/ozi-rtfg-ags
>
> Cheers!
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>


This week's dev sync

2018-10-09 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags

Cheers!


Unique Sketch aggregations and bias correction

2018-09-24 Thread Charles Allen
https://github.com/apache/incubator-druid/pull/5712 adds some great
functionality to the Datasketches hooks in Druid.

One thing noted in
https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html
is the severe bias the druid HLL implementation shows at ~5k uniques being
fed in. This is something we've seen in a severe way internally, where a
bias of a few percent makes a big difference in results. As such, I'm
curious if anyone has done any research into simple bias correction to
attempt to minimize the error seen on the outputs around the error state?

Cheers,
Charles Allen


This week's dev sync

2018-09-11 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags

Cheers!
Charles Allen


This week's dev sync

2018-09-04 Thread Charles Allen
To join the video meeting, click this link:
https://meet.google.com/ozi-rtfg-ags
Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
To view more phone numbers, click this link:
https://tel.meet/ozi-rtfg-ags?hs=5

Cheers!


Re: Towards 0.13 (Apache release)

2018-08-30 Thread Charles Allen
I just merged to one of the PRs I'm working on, and after upping the rename
limit to pretty high it was relatively painless

On Thu, Aug 30, 2018 at 10:24 AM Slim Bouguerra 
wrote:

> Thanks Gian worked for me!
>
> > On Aug 30, 2018, at 10:06 AM, Gian Merlino  wrote:
> >
> > That PR is merged now! If anyone here still has outstanding PRs that are
> > now in conflict with master, try running this before merging master, it
> > really helps git out.
> >
> >  git config --local merge.renameLimit 5000
> >
> > My experience was that even a patch with a few dozen changed files merged
> > pretty cleanly, after setting this config. I just had a few conflicts to
> > resolve in imports.
> >
> > On Wed, Aug 29, 2018 at 4:09 PM Gian Merlino  wrote:
> >
> >> I just raised https://github.com/apache/incubator-druid/pull/6266. I
> >> think for sanity's sake, I would really appreciate it if we got this one
> >> merged before merging any other PRs. (It will conflict with 100% of
> other
> >> PRs)
> >>
> >> On Wed, Aug 29, 2018 at 9:34 AM Gian Merlino  wrote:
> >>
> >>> Hi everyone,
> >>>
> >>> As we continue towards 0.13 I started looking into the "great renaming"
> >>> (of all packages from io.druid -> org.apache.druid) and am getting a PR
> >>> ready. I know Slim is working on
> >>> https://github.com/apache/incubator-druid/pull/6215 too (automated
> >>> license checking and some header fixups).
> >>>
> >>> Other than these Apache related items, we have 26 open issues/PRs in
> the
> >>> 0.13.0 milestone:
> https://github.com/apache/incubator-druid/milestone/25.
> >>> Is this everything we want to include? Is anything there we should
> bump to
> >>> the next release? Is anything _not_ there that needs to be added?
> >>>
> >>> Let's figure out when we can target a code freeze -- the start of the
> RC
> >>> train for our first Apache release!!
> >>>
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


dev Sync this week

2018-08-28 Thread Charles Allen
To join the video meeting, click this link:
https://meet.google.com/ozi-rtfg-ags
Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
To view more phone numbers, click this link:
https://tel.meet/ozi-rtfg-ags?hs=5


September LA Meetup!

2018-08-27 Thread Charles Allen
https://www.meetup.com/druidio-la/events/254080924/

Hello all!

Just fyi, I will be hosting a meetup in Santa Monica, California on
September 20th. I'll discuss ways we are using Druid internally at Snap and
how its strengths fit into some of the big-data strategies.

If you are in the area please stop by for some good tech talks and
networking opportunities!

Best Regards,
Charles Allen


This week's dev sync

2018-08-21 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags
sorry for the delay


Re: Dev sync this week

2018-08-14 Thread Charles Allen
Oh noes! call is full. Anyone know if there's an ASF video or phone
conferencing system that can be used with a higher limit?

On Tue, Aug 14, 2018 at 10:02 AM Jihoon Son  wrote:

> Hi Charles,
>
> I can host today. Here is the link:
> https://hangouts.google.com/call/FrlkHZryggXkzgIOkZCdAAEE.
>
> Best,
> Jihoon
>
> On Tue, Aug 14, 2018 at 9:51 AM Charles Allen
>  wrote:
>
> > I will be late to the dev sync this week and might not be able to make
> it,
> > can anyone else start it?
> >
> > Thank you,
> > Charles Allen
> >
>


Re: Druid 0.12.2 release vote

2018-08-09 Thread Charles Allen
Oh probably. Long story short: approximately equivalent, new release has
slightly tucked in long tail on the bad side of query time (but also few
samples in the long tail overall).



On Thu, Aug 9, 2018 at 10:37 AM Gian Merlino  wrote:

> Nice!!
>
> Although I don't see the graphic attached, maybe the mailing list ate it?
>
> On Wed, Aug 8, 2018 at 4:15 PM Charles Allen  .invalid>
> wrote:
>
> > Blue is 0.12.2 with some minor backports not perf related. Red is from
> the
> > 0.11.x series. This is effectively a bucketed PDF of the query times for
> a
> > live cluster with Timeseries queries as self-reported by historical
> nodes.
> > I mentioned elsewhere I'm not convinced query/time is a good proxy for
> user
> > experience, but it does provide a good baseline for comparisons between
> > versions. Low query times are suspected due to some aggressive caching or
> > complete node misses (node very little data for that time range for that
> > datasource). And high query time outliers are often the result of bad GC.
> >
> > On our side there is a new java version going out with the 0.12.2
> > deployment so it is unclear how much is attributed to the new java
> version
> > and how much is attributed to the druid jars or other config changes.
> > Overall things seem to consistently display a small % improvement in the
> > mean with our internal 0.12.2 release. This is good!
> >
> > Cheers,
> > Charles Allen
> >
> > [image: Screen Shot 2018-08-08 at 4.01.24 PM.png]
> >
> >
> > On Wed, Aug 8, 2018 at 3:11 PM David Lim  wrote:
> >
> >> +1, thank you!
> >>
> >> On Wed, Aug 8, 2018 at 3:16 PM Jonathan Wei  wrote:
> >>
> >> > +1, thanks Jihoon!
> >> >
> >> > On Wed, Aug 8, 2018 at 1:18 PM, Jihoon Son 
> >> wrote:
> >> >
> >> > > Awesome! Thanks Charles!
> >> > >
> >> > > Jihoon
> >> > >
> >> > > On Wed, Aug 8, 2018 at 1:16 PM Gian Merlino 
> wrote:
> >> > >
> >> > > > Thanks, it will be nice to see!
> >> > > >
> >> > > > On Wed, Aug 8, 2018 at 1:15 PM Charles Allen <
> >> charles.al...@snap.com
> >> > > > .invalid>
> >> > > > wrote:
> >> > > >
> >> > > > > I don't think it should be a blocker to release, but I have to
> run
> >> > perf
> >> > > > > tests for rollouts anyways so I figured I'd publish what I find
> >> :-P
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Charles Allen
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Aug 8, 2018 at 12:33 PM Gian Merlino 
> >> > wrote:
> >> > > > >
> >> > > > > > That being said, Charles I am definitely looking forward to
> your
> >> > > report
> >> > > > > of
> >> > > > > > what the upgrade from 0.11 -> 0.12.2-rc1 is like in your
> >> cluster!
> >> > > > > >
> >> > > > > > On Wed, Aug 8, 2018 at 12:30 PM Gian Merlino  >
> >> > > wrote:
> >> > > > > >
> >> > > > > > > My thought is that recently we have started doing small
> >> bug-fix
> >> > > > > releases
> >> > > > > > > more often (0.12.1 and 0.12.2 were both small releases) and
> I
> >> > think
> >> > > > it
> >> > > > > > > makes sense to continue this practice. It makes sense to get
> >> them
> >> > > out
> >> > > > > > > quickly, since shipping bug fixes is good. IMO trying to
> >> validate
> >> > > bug
> >> > > > > fix
> >> > > > > > > releases within the customary Apache style 72 hour voting
> >> period
> >> > > is a
> >> > > > > > good
> >> > > > > > > goal.
> >> > > > > > >
> >> > > > > > > On the other hand we do strive to put out high quality
> >> releases,
> >> > > and
> >> > > > we
> >> > > > > > > don't want bug fix releases to introduce regressions.
> Testing
> >> > every
> >> > > > > > single
> >> > > > > &g

Re: Druid 0.12.2 release vote

2018-08-08 Thread Charles Allen
Blue is 0.12.2 with some minor backports not perf related. Red is from the
0.11.x series. This is effectively a bucketed PDF of the query times for a
live cluster with Timeseries queries as self-reported by historical nodes.
I mentioned elsewhere I'm not convinced query/time is a good proxy for user
experience, but it does provide a good baseline for comparisons between
versions. Low query times are suspected due to some aggressive caching or
complete node misses (node very little data for that time range for that
datasource). And high query time outliers are often the result of bad GC.

On our side there is a new java version going out with the 0.12.2
deployment so it is unclear how much is attributed to the new java version
and how much is attributed to the druid jars or other config changes.
Overall things seem to consistently display a small % improvement in the
mean with our internal 0.12.2 release. This is good!

Cheers,
Charles Allen

[image: Screen Shot 2018-08-08 at 4.01.24 PM.png]


On Wed, Aug 8, 2018 at 3:11 PM David Lim  wrote:

> +1, thank you!
>
> On Wed, Aug 8, 2018 at 3:16 PM Jonathan Wei  wrote:
>
> > +1, thanks Jihoon!
> >
> > On Wed, Aug 8, 2018 at 1:18 PM, Jihoon Son  wrote:
> >
> > > Awesome! Thanks Charles!
> > >
> > > Jihoon
> > >
> > > On Wed, Aug 8, 2018 at 1:16 PM Gian Merlino  wrote:
> > >
> > > > Thanks, it will be nice to see!
> > > >
> > > > On Wed, Aug 8, 2018 at 1:15 PM Charles Allen  > > > .invalid>
> > > > wrote:
> > > >
> > > > > I don't think it should be a blocker to release, but I have to run
> > perf
> > > > > tests for rollouts anyways so I figured I'd publish what I find :-P
> > > > >
> > > > > Cheers,
> > > > > Charles Allen
> > > > >
> > > > >
> > > > > On Wed, Aug 8, 2018 at 12:33 PM Gian Merlino 
> > wrote:
> > > > >
> > > > > > That being said, Charles I am definitely looking forward to your
> > > report
> > > > > of
> > > > > > what the upgrade from 0.11 -> 0.12.2-rc1 is like in your cluster!
> > > > > >
> > > > > > On Wed, Aug 8, 2018 at 12:30 PM Gian Merlino 
> > > wrote:
> > > > > >
> > > > > > > My thought is that recently we have started doing small bug-fix
> > > > > releases
> > > > > > > more often (0.12.1 and 0.12.2 were both small releases) and I
> > think
> > > > it
> > > > > > > makes sense to continue this practice. It makes sense to get
> them
> > > out
> > > > > > > quickly, since shipping bug fixes is good. IMO trying to
> validate
> > > bug
> > > > > fix
> > > > > > > releases within the customary Apache style 72 hour voting
> period
> > > is a
> > > > > > good
> > > > > > > goal.
> > > > > > >
> > > > > > > On the other hand we do strive to put out high quality
> releases,
> > > and
> > > > we
> > > > > > > don't want bug fix releases to introduce regressions. Testing
> > every
> > > > > > single
> > > > > > > patch in real clusters is an important part of that. All I can
> do
> > > is
> > > > > > > encourage people running real clusters to deploy RCs as fast as
> > > they
> > > > > can!
> > > > > > > Fwiw, we have already incorporated all the 0.12.2 patches into
> > our
> > > > > Imply
> > > > > > > distro of Druid and already have a good number users running
> > them.
> > > So
> > > > > my
> > > > > > +1
> > > > > > > earlier incorporated knowledge that the patches have been
> > validated
> > > > in
> > > > > > that
> > > > > > > way.
> > > > > > >
> > > > > > > I agree with Jihoon that we will probably end up doing an
> 0.12.3
> > > > soon,
> > > > > to
> > > > > > > fix the issues he mentioned and a couple of others as well.
> > > > > > >
> > > > > > > On Wed, Aug 8, 2018 at 12:07 PM Jihoon Son <
> jihoon...@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> Charles, thank you for doing performan

Re: Druid 0.12.2 release vote

2018-08-08 Thread Charles Allen
I don't think it should be a blocker to release, but I have to run perf
tests for rollouts anyways so I figured I'd publish what I find :-P

Cheers,
Charles Allen


On Wed, Aug 8, 2018 at 12:33 PM Gian Merlino  wrote:

> That being said, Charles I am definitely looking forward to your report of
> what the upgrade from 0.11 -> 0.12.2-rc1 is like in your cluster!
>
> On Wed, Aug 8, 2018 at 12:30 PM Gian Merlino  wrote:
>
> > My thought is that recently we have started doing small bug-fix releases
> > more often (0.12.1 and 0.12.2 were both small releases) and I think it
> > makes sense to continue this practice. It makes sense to get them out
> > quickly, since shipping bug fixes is good. IMO trying to validate bug fix
> > releases within the customary Apache style 72 hour voting period is a
> good
> > goal.
> >
> > On the other hand we do strive to put out high quality releases, and we
> > don't want bug fix releases to introduce regressions. Testing every
> single
> > patch in real clusters is an important part of that. All I can do is
> > encourage people running real clusters to deploy RCs as fast as they can!
> > Fwiw, we have already incorporated all the 0.12.2 patches into our Imply
> > distro of Druid and already have a good number users running them. So my
> +1
> > earlier incorporated knowledge that the patches have been validated in
> that
> > way.
> >
> > I agree with Jihoon that we will probably end up doing an 0.12.3 soon, to
> > fix the issues he mentioned and a couple of others as well.
> >
> > On Wed, Aug 8, 2018 at 12:07 PM Jihoon Son  wrote:
> >
> >> Charles, thank you for doing performance evaluation! Performance numbers
> >> are always good and helpful.
> >>
> >> However, IMO, any kind of performance degradation shouldn't be a blocker
> >> for this release. 0.12.2 is a minor release and contains only bug fixes.
> >> https://github.com/apache/incubator-druid/pull/5878/files is the only
> one
> >> tagged with 'Performance', but it can be regarded as a more like a code
> >> bug
> >> rather than architectural performance issue.
> >>
> >> Instead, those kinds of performance tests should be performed per major
> >> release to catch any kinds of unexpected performance change. They can
> be a
> >> blocker if we find any performance regression.
> >>
> >> Also, if you find any performance regression for this release, we
> probably
> >> make another minor release. I think some bug fixes (e.g.,
> >> https://github.com/apache/incubator-druid/issues/6124,
> >> https://github.com/apache/incubator-druid/issues/6123) are also worth
> to
> >> be
> >> included in the minor release.
> >>
> >> What do you think?
> >>
> >> Best,
> >> Jihoon
> >>
> >> On Wed, Aug 8, 2018 at 10:18 AM Charles Allen
> >>  wrote:
> >>
> >> > I'm hoping to have some numbers for any performance changes or other
> >> > impacts in the next few days (rollouts on big clusters take a long
> >> time). I
> >> > am neutral until the numbers come in. Preliminary indicators show no
> >> > significant regression since the 0.11.x series. More data is expected
> >> to be
> >> > available in a few days as rollout completes.
> >> >
> >> >
> >> >
> >> > On Wed, Aug 8, 2018 at 9:10 AM Himanshu  wrote:
> >> >
> >> > > +1 , thanks for coordinating it.
> >> > >
> >> > > On Tue, Aug 7, 2018 at 8:05 PM, Gian Merlino 
> wrote:
> >> > >
> >> > > > +1. Thank you Jihoon for running this release.
> >> > > >
> >> > > > On Tue, Aug 7, 2018 at 10:04 AM Jihoon Son 
> >> > wrote:
> >> > > >
> >> > > > > Sure,
> >> > > > >
> >> > > > > the release note is available here:
> >> > > > > https://github.com/apache/incubator-druid/issues/6116.
> >> > > > >
> >> > > > > Best,
> >> > > > > Jihoon
> >> > > > >
> >> > > > > On Tue, Aug 7, 2018 at 10:02 AM Charles Allen <
> cral...@apache.org
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > ((don't let this ask block the release))
> >> > > > > >
> >> > > > > > Is there a way to get a preview of 

Re: Druid 0.12.2 release vote

2018-08-07 Thread Charles Allen
((don't let this ask block the release))

Is there a way to get a preview of what the release notice will look like?

On Mon, Aug 6, 2018 at 3:38 PM Fangjin Yang  wrote:

> +1
>
> On Mon, Aug 6, 2018 at 3:03 PM, Jihoon Son  wrote:
>
> > Hi all,
> >
> > Druid 0.12.2-rc1 (http://druid.io/downloads.html) is available now, and
> I
> > think it's time to vote on the 0.12.2 release. Please note that 0.12.2 is
> > not an ASF release.
> >
> > Here is my +1.
> >
> > Best,
> > Jihoon
> >
>


This week's dev sync

2018-08-07 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags

See you soon!


Dev Sync today

2018-07-31 Thread Charles Allen
I will likely be late to the dev sync, is anyone else able to start it this
week?


Re: Build failure on 0.13.SNAPSHOT

2018-07-25 Thread Charles Allen
OOME seems to be showing up in some of the Travis testing as well for group
by related stuff. Unsure what's going on there.

On Tue, Jul 24, 2018 at 9:46 PM Dongjin Lee  wrote:

> After some experiments, I figured out the following:
>
> 1. Druid uses above 8gb of memory for testing. (building-druid.png)
> 2. With 8gb(physical)+4gb(swap) of memory, the test succeeds regardless of
> maven version (3.3.9, 3.5.2, 3.5.4) or MAVEN_OPTS. However, with
> 8gb(physical)+2gb(swap) of memory[^1], some tests failed. The list of
> failing tests differs between maven 3.3.9 and 3.5.2.
>
> In short, retaining sufficient memory solved the problem - *It seems like
> 12gb of memory is a recommended setting for building druid.* (I guess
> lots of you are working with the MacBook Pro with 16gb RAM, right? In that
> case, you must not have encountered this problem.)
>
> If you are okay, may I update the building documentation for the newbies
> like me?
>
> Thanks,
> Dongjin
>
> +1. While building Druid, I found another problem. But this issue should
> be discussed in another thread.
>
> [^1]: You know, the other processes also occupy the memory.
>
>
> On Tue, Jul 24, 2018 at 3:07 AM Jihoon Son  wrote:
>
>> I'm also using Maven 3.5.2 and not using any special configurations for
>> Maven, but I have never seen that error too.
>> Most of our Travis jobs have been working with only 512 MB of direct
>> memory. Only the 'strict compilation' Travis job requires 3 GB of memory.
>>
>> I think it's worthwhile to look into this more. Maybe we somehow use more
>> memory when we run all tests by 'mvn install'. Maybe this relates to the
>> frequent transient failures of 'processing module test', one of our Travis
>> jobs.
>>
>> Jihoon
>>
>> On Mon, Jul 23, 2018 at 9:32 AM Gian Merlino  wrote:
>>
>> > Interesting. Fwiw, I am using Maven 3.5.2 for building Druid and it has
>> > been working for for me. I don't think I"m using any special Maven
>> > overrides (at least, I don't see anything interesting in my ~/.m2
>> directory
>> > or in my environment variables). It might have to do with how much
>> memory
>> > our machines have? I do most of my builds on a Mac with 16GB RAM. Maybe
>> try
>> > checking .travis.yml in the druid repo. It sets -Xmx3000m for mvn
>> install
>> > commands, which might be needed for more low memory environments.
>> >
>> > $ mvn --version
>> > Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d;
>> > 2017-10-18T00:58:13-07:00)
>> > Maven home: /usr/local/Cellar/maven/3.5.2/libexec
>> > Java version: 1.8.0_161, vendor: Oracle Corporation
>> > Java home:
>> > /Library/Java/JavaVirtualMachines/jdk1.8.0_161.jdk/Contents/Home/jre
>> > Default locale: en_US, platform encoding: UTF-8
>> > OS name: "mac os x", version: "10.13.5", arch: "x86_64", family: "mac"
>> >
>> > On Mon, Jul 23, 2018 at 6:40 AM Dongjin Lee  wrote:
>> >
>> > > Finally, it seems like I found the reason. It was a composition of
>> > several
>> > > problems:
>> > >
>> > > - Druid should not be built with maven 3.5.x. With 3.5.2, Test suites
>> > like
>> > > `GroupByQueryRunnerFailureTest` fails. After I switched into 3.3.9
>> which
>> > is
>> > > built in the latest version of IntelliJ, those errors disappeared. It
>> > seems
>> > > like maven 3.5.x is not stable yet - it applied a drastic change, and
>> it
>> > is
>> > > also why they skipped 3.4.x.
>> > > - It seems like Druid requires some MaxDirectMemorySize configuration
>> for
>> > > some test suites. With some JVM parameter like
>> > `-XX:MaxDirectMemorySize=4g`
>> > > some test suites were passed, but not all. I am now trying the other
>> > > options with enlarged swap space.
>> > >
>> > > Question: How much MaxDirectMemorySize configuration are you using?
>> > >
>> > > Best,
>> > > Dongjin
>> > >
>> > > On Sat, Jul 21, 2018 at 3:01 AM Jihoon Son 
>> wrote:
>> > >
>> > > > Hi Dongjin,
>> > > >
>> > > > that is weird. It looks like the vm crashed because of out of memory
>> > > while
>> > > > testing.
>> > > > It might be a real issue or not.
>> > > > Have you set any memory configuration for your maven?
>> > > >
>> > > > Jihoon
>> > > >
>> > > > On Thu, Jul 19, 2018 at 7:09 PM Dongjin Lee 
>> > wrote:
>> > > >
>> > > > > Hi Jihoon,
>> > > > >
>> > > > > I ran `mvn clean package` following development/build
>> > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/incubator-druid/blob/master/docs/content/development/build.md
>> > > > > >
>> > > > > .
>> > > > >
>> > > > > Dongjin
>> > > > >
>> > > > > On Fri, Jul 20, 2018 at 12:30 AM Jihoon Son > >
>> > > > wrote:
>> > > > >
>> > > > > > Hi Dongjin,
>> > > > > >
>> > > > > > what maven command did you run?
>> > > > > >
>> > > > > > Jihoon
>> > > > > >
>> > > > > > On Wed, Jul 18, 2018 at 10:38 PM Dongjin Lee <
>> dong...@apache.org>
>> > > > wrote:
>> > > > > >
>> > > > > > > Hello. I am trying to build druid, but it fails. My
>> environment
>> > is
>> > > > like
>> > > > > > the
>> > > > > > > following:
>> > > > > > >
>> > 

Dev sync

2018-07-24 Thread Charles Allen
Is someone else able to start the dev sync this week? I'm out of town at a
conference.


Re: Multi-threaded Druid Tests/Benchmarks

2018-07-18 Thread Charles Allen
io.druid.benchmark.query.TopNBenchmark is the one that tore up heap when i
was trying to test alternate strategies for
https://github.com/apache/incubator-druid/pull/5913 and
https://github.com/apache/incubator-druid/pull/6014 locally. You can
control the number of segments created.

On Wed, Jul 18, 2018 at 12:35 AM Anastasia Braginsky
 wrote:

>  So this is probably where we can help with the Oak-based incremental
> index.Can you please give me any reference to those tests? Any descriptions?
> Thanks!
>
> On Tuesday, July 17, 2018, 8:59:57 PM GMT+3, Charles Allen <
> charles.al...@snap.com.INVALID> wrote:
>
>  Unfortunately I think multi-threaded test coverage is kind of weak and
> historically very hart to test. There are some topN benchmarks but they are
> very limited as they don't scale well (heap gets blasted from incremental
> index) with a large concurrency level.
>
> On Sun, Jul 15, 2018 at 6:35 AM Anastasia Braginsky
>  wrote:
>
> > Hi Everybody,
> > From last Tuesday Druid's meeting I recall Charles mentioned some Druid's
> > multi-threaded tests/benchmarks that can be applied end-to-end to check
> the
> > performance.
> > Can I get some references/names so I can start investigating this
> > direction from multi-threaded Oak-in-Druid perspective?Thanks!
> >
> >
>


Re: Multi-threaded Druid Tests/Benchmarks

2018-07-17 Thread Charles Allen
Unfortunately I think multi-threaded test coverage is kind of weak and
historically very hart to test. There are some topN benchmarks but they are
very limited as they don't scale well (heap gets blasted from incremental
index) with a large concurrency level.

On Sun, Jul 15, 2018 at 6:35 AM Anastasia Braginsky
 wrote:

> Hi Everybody,
> From last Tuesday Druid's meeting I recall Charles mentioned some Druid's
> multi-threaded tests/benchmarks that can be applied end-to-end to check the
> performance.
> Can I get some references/names so I can start investigating this
> direction from multi-threaded Oak-in-Druid perspective?Thanks!
>
>


TopN folding and result ordering (and maybe group by)

2018-07-17 Thread Charles Allen
I brought this up in the Dev Sync but thought I would write up a couple of
findings here.

We have some large results in TopN queries that come back, and have been
looking at optimizations in the TopN (or GroupBy) query path in order to
accommodate these larger results sets returning from many hundreds of nodes.

Looking at the TopN binary apply function
io.druid.query.topn.TopNBinaryFn#apply
<https://github.com/apache/incubator-druid/blob/druid-0.12.1/processing/src/main/java/io/druid/query/topn/TopNBinaryFn.java#L75-L135>
which does the result folding, there is a basic hash-join of the two
results in order to do the fold. This ends up with a lot of hash map
operations for creation and adding entries. I tried some really basic
optimizations to reduce the number of hash map operations in this function,
but they did not result in any measurable improvement in a real environment.

You can see some cpu time flame graphs in
https://github.com/apache/incubator-druid/pull/5913 . Ideally work should
be done in the aggregator combining functions rather than a whole bunch of
hash map state manipulations.

One potential improvement would be to move from a hash join to a merge
join. But such a scenario would require changing of the ordering of the
items so that they can be merge joined. The current ordering is based on
aggregation specification order. This change should allow iterating through
the topn result values only once, and have a simple way to insert new
values in one stream or another into the result topn result value.

This means the query path would sort the results on query time, and shuffle
the result to retain "specification order" only on the last stage out
(during a "finalize" kind of step).

In such a scenario, a potential future optimization would be to allow
results to be streamed back per topn result value. I *think* the current
implementation only considers the timestamp level, meaning if you do an ALL
granularity query, there is only one "chunk" of results that can be
streamed back. I haven't been digging deeply into this aspect though.

Such an optimization should be able to be applied to group by queries as
well, so I don't know if the folks working heavily on the group by queries
have considered this or alternatives.

My question is as follows:

Are there any issues people see for either using the Finalize flag of a
query to determine the sort order, or adding a new query context to
determine if the sort order should be specification order (default) or
lexicographic order (internal override) ?


Thanks,
Charles Allen


Re: This week's dev sync

2018-07-17 Thread Charles Allen
Some on the call mentioned there were some oddities with logins this week.
Using incognito mode worked to fix the login issues

On Tue, Jul 17, 2018 at 9:50 AM Charles Allen  wrote:

> To join the video meeting, click this link:
> https://meet.google.com/ozi-rtfg-ags
> Otherwise, to join by phone, dial +1 442-666-1256 <(442)%20666-1256> and
> enter this PIN: 6867#
> To view more phone numbers, click this link:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__tel.meet_ozi-2Drtfg-2Dags-3Fhs-3D5=DwIBaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=1wPPI51ORZ7ZpAIobyVz8ImEYk2-l5xHpK_mLviWSkE=KhUyTEIyr-eJ9k3YrAZ5xyEEqxRy300aoqb2d8Ls6Aw=
>


This week's dev sync

2018-07-17 Thread Charles Allen
To join the video meeting, click this link:
https://meet.google.com/ozi-rtfg-ags
Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
To view more phone numbers, click this link:
https://tel.meet/ozi-rtfg-ags?hs=5


Java script engine to be removed

2018-07-14 Thread Charles Allen
http://openjdk.java.net/jeps/335

https://bugs.openjdk.java.net/browse/JDK-8202786

The javascript Nashorn engine is deprecated and slated to be removed in the
next long term support release of Java.

https://github.com/apache/incubator-druid/issues/5589 is the ticket for
maintaining future java support in Druid.

Not quite sure what the best way forward is. Should we revert back to rhino
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino ?


Re: Druid 0.12.2-rc1 vote

2018-07-10 Thread Charles Allen
Brought this up in the dev sync:

I saw a lot of PRs and fixes for Coordinator segment balancing related to
some regressions that happened in 0.12.x . Is anyone able to give a rundown
of the state of coordinator segment management for the 0.12.2 RC?

On Tue, Jul 10, 2018 at 10:26 AM Nishant Bangarwa 
wrote:

> +1
>
> --
> Nishant Bangarwa
>
> Hortonworks
>
> On 7/10/18, 3:57 AM, "Jihoon Son"  wrote:
>
> Related thread:
>
> https://lists.apache.org/thread.html/76755aecfddb1210fcc3f08b1d4631784a8a5eede64d22718c271841@%3Cdev.druid.apache.org%3E
> .
>
> Jihoon
>
> On Mon, Jul 9, 2018 at 3:25 PM Jihoon Son 
> wrote:
>
> > Hi all,
> >
> > We have no open issues and PRs for 0.12.2 (
> > https://github.com/apache/incubator-druid/milestone/27). The 0.12.2
> > branch is already available and all PRs for 0.12.2 have merged into
> that
> > branch.
> >
> > Let's vote on releasing RC1. Here is my +1.
> >
> > This is a non-ASF release.
> >
> > Best,
> > Jihoon
> >
>
>
>


This week's dev sync

2018-07-10 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags


Sync this week

2018-07-03 Thread Charles Allen
I am indisposed this week. Can anyone else start the sync up?


This week's dev sync

2018-06-12 Thread Charles Allen
To join the video meeting, click this link:
https://meet.google.com/ozi-rtfg-ags
Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
To view more phone numbers, click this link:
https://tel.meet/ozi-rtfg-ags?hs=5


Re: Update log4j version to >= 2.8

2018-06-12 Thread Charles Allen
This would actually be helpful as more recent log4j json formats have
better and more controllable options. This helps when druid stdout/stderr
logs get pumped directly into a rich logging system (like stackdriver or
splunk or sumologic). But there are a few changes in tests that need to be
fixed to accommodate newer versions. And if anyone is running custom log4j
plugins (pretty common) it can affect that as well.

I'm +1 for upgrading log4j though

On Tue, Jun 12, 2018 at 7:24 AM  wrote:

> Hi,
> We realized on our production cluster that we're missing log files. After
> some investigation it seem that old version of log4j (< 2.8) has a maximum
> nb of files set to 7 for the Default Rollover Strategies
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dlogging.6191.n7.nabble.com_Max-2Dindex-2Dlimit-2Din-2DDefaultRolloverStrategy-2Dtd75592.html=DwIGaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=sj4O8z3MxHdxyf62OHMCGou7JGFNE0EBmlBWXwfvmZE=lop_6kILhYHYU-coZD4bZ-nZq0YgDbz4vJAKIkpv550=.
> From version 2.8 this default was removed (
> https://urldefense.proofpoint.com/v2/url?u=https-3A__logging.apache.org_log4j_2.x_manual_appenders.html-23RollingFileAppender=DwIGaQ=ncDTmphkJTvjIDPh0hpF_w=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo=sj4O8z3MxHdxyf62OHMCGou7JGFNE0EBmlBWXwfvmZE=xcE3haXTrQzaJKrT0rXKchwdAQPjwvVXNzMbHPazGx4=).
> Would it has a lot of impact to update log4j to a more recent version ?
> Cheers
> Abson
>


New Druid Meetup group for LA / Venice / Santa Monica!

2018-06-09 Thread Charles Allen
Hello all,

I spawned up a meetup for the LA area at https://www.meetup.com/druidio-la/ .
The reason for a different meetup is so the location stuff at meetup.com
works correctly (compared to https://www.meetup.com/druidio ). If you are
interested in keeping in touch with other analytics lovers in the LA area
please feel free to join!

Regards,
Charles Allen


Committers

2018-06-09 Thread Charles Allen
Hi all,

I'm looking at https://github.com/orgs/apache/teams/apache-committers and
wondering if there's a way to give them rights into the druid-io github
until we get the code moved over to the official ASF git. Does anyone
happen to know a way to add external teams to a different org's team or
collaborator structure in GitHub?

Thank you,
Charles Allen


Dev sync this week:

2018-06-05 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags

Cheers!
Charles Allen


Re: Apache project name for Druid

2018-06-04 Thread Charles Allen
http://tsdr.uspto.gov/#caseNumber=85681881=SERIAL_NO=statusSearch
just
for note, this is not expected to be a concern.

+1


On Mon, Jun 4, 2018 at 3:40 PM Jihoon Son  wrote:

> +1
>
> On Mon, Jun 4, 2018 at 2:54 PM Xavier Léauté  wrote:
>
> > +1
> >
> > On Mon, Jun 4, 2018 at 2:49 PM Maxime Beauchemin <
> > maximebeauche...@gmail.com>
> > wrote:
> >
> > > +1
> > >
> > > On Mon, Jun 4, 2018 at 2:49 PM Gian Merlino  wrote:
> > >
> > > > +1 for keeping the name 'Druid'
> > > >
> > > > On Mon, Jun 4, 2018 at 1:10 PM, Atul Mohan 
> > > > wrote:
> > > >
> > > > > Hello All,
> > > > >
> > > > > I'm planning to get started with one of the Apache incubation task
> > > items
> > > > to
> > > > > ensure the project name does not already exist (
> > > > > https://github.com/druid-io/druid/issues/5823 ).
> > > > > Before I create a JIRA against the PODLINGNAMESEARCH JIRA group, I
> > just
> > > > > wanted to confirm that the project name that we need is: *Apache
> > > Druid*.
> > > > > Could the committers please do a quick vote on this name so that
> once
> > > we
> > > > > have a majority, I can go ahead with the JIRA?
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > Atul Mohan
> > > > > 
> > > > >
> > > >
> > >
> >
>


Re: Apache Incubation task items

2018-06-01 Thread Charles Allen
Thank you Atul!

The items are open to anyone who has the capacity to help with them. There
are a few items like Intellectual Property assignments that are hard for
the general community to assist with, but there are many others which would
be great.

Cheers,
Charles Allen

On Thu, May 31, 2018 at 11:28 AM Atul Mohan  wrote:

> Hello Charles,
> Going through the item list, I just had a quick question. Are all these
> tasks meant to be taken up only by the committers? If there are tasks which
> can be completed by contributors, I would be happy to help.
>
> Thanks,
> Atul
>
> On Thu, May 31, 2018 at 12:42 PM, Charles Allen 
> wrote:
>
> > https://github.com/druid-io/druid/projects/3 is a list of all the items
> in
> > http://incubator.apache.org/projects/druid.html
> >
> > We will need help getting these resourced and completed. For a thing to
> be
> > completed and closed, the page at
> > http://incubator.apache.org/projects/druid.html needs updated with any
> > relevant information.
> >
> > I have also created a new label
> > https://github.com/druid-io/druid/issues?q=is%3Aissue+is%
> > 3Aopen+label%3AApache
> > for
> > any issues related to being a part of ASF, not specifically related to
> the
> > Druid code itself.
> >
> > The kanban board is in no specific order, so please do not take the
> > relative order or issue number as any sort of indicator.
> >
> > Thank you all for your assistance as we go along this exciting path!
> >
> > Cheers,
> > Charles Allen
> >
>
>
>
> --
> Atul Mohan
> 
>


Apache Incubation task items

2018-05-31 Thread Charles Allen
https://github.com/druid-io/druid/projects/3 is a list of all the items in
http://incubator.apache.org/projects/druid.html

We will need help getting these resourced and completed. For a thing to be
completed and closed, the page at
http://incubator.apache.org/projects/druid.html needs updated with any
relevant information.

I have also created a new label
https://github.com/druid-io/druid/issues?q=is%3Aissue+is%3Aopen+label%3AApache
for
any issues related to being a part of ASF, not specifically related to the
Druid code itself.

The kanban board is in no specific order, so please do not take the
relative order or issue number as any sort of indicator.

Thank you all for your assistance as we go along this exciting path!

Cheers,
Charles Allen


Re: Access to jira

2018-05-31 Thread Charles Allen
Sounds good. I'd like to put some more formal tracking and responsibility
to the remaining incubator items. Would github issues be the preferred
place to do that?

On Thu, May 31, 2018 at 9:20 AM Gian Merlino  wrote:

> I think we are planning to keep using GitHub issues, based on the
> discussion in the migration logistics thread. And based on the fact that
> Apache seems to allow that now (https://github.com/apache/fluo was given
> as
> an example). So probably the right thing to do is update
> http://incubator.apache.org/projects/druid.html accordingly?
>
> On Thu, May 31, 2018 at 9:15 AM, Charles Allen  wrote:
>
> > Hi all
> >
> > http://incubator.apache.org/projects/druid.html says that
> > https://issues.apache.org/jira/browse/DRUID is our issue tracker, but I
> > don't seem to have access to it. Does anyone know how to apply for access
> > using an existing Apache JIRA login?
> >
> > Thanks,
> > Charles Allen
> >
>


Access to jira

2018-05-31 Thread Charles Allen
Hi all

http://incubator.apache.org/projects/druid.html says that
https://issues.apache.org/jira/browse/DRUID is our issue tracker, but I
don't seem to have access to it. Does anyone know how to apply for access
using an existing Apache JIRA login?

Thanks,
Charles Allen


Re: Druid 0.12.1 release vote

2018-05-29 Thread Charles Allen
+1

On Tue, May 29, 2018 at 11:04 AM Himanshu  wrote:

> +1
>
> On Tue, May 29, 2018 at 11:02 AM, Jonathan Wei  wrote:
>
> > As discussed on the sync up call this morning, let's vote on the 0.12.1
> > release.
> >
> > Thanks,
> > Jon
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>


This week's dev sync

2018-05-22 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags


Sync up this week

2018-05-15 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags


This week's dev sync

2018-05-08 Thread Charles Allen
https://meet.google.com/ozi-rtfg-ags


Sync-up today

2018-04-24 Thread Charles Allen
Is some one else able to start the sync-up today? I will be unable to make
it again.

Thank you,
Charles Allen


This week's sync up

2018-04-10 Thread Charles Allen
To join the video meeting, click this link:
https://meet.google.com/ozi-rtfg-ags
Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
To view more phone numbers, click this link:
https://tel.meet/ozi-rtfg-ags?hs=5


This week's Dev Sync

2018-04-03 Thread Charles Allen
To join the video meeting, click this link:
https://meet.google.com/ozi-rtfg-ags
Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN: 6867#
To view more phone numbers, click this link:
https://tel.meet/ozi-rtfg-ags?hs=5


RE: [druid-dev] any reason to still keep overlord as separate node?

2018-03-31 Thread Charles Allen
I am one of those odd cases and agree with Gian on both accounts.

-Original Message-
From: Gian Merlino  
Sent: Saturday, March 31, 2018 11:15 AM
To: druid-developm...@googlegroups.com; dev@druid.apache.org
Subject: Re: [druid-dev] any reason to still keep overlord as separate node?

Hi Prashant,

The only issue that I can think of is that in some (super large) clusters, the 
coordinator and overlord can both be pretty demanding in terms of memory and it 
helps for scalability to have them be separate. But this is not the common case 
- most clusters are smaller or medium sized. So it makes sense for the default 
to be combining them. I would support a patch that changed the defaults and 
updated the docs accordingly.

Btw, since we are trying to migrate the dev mailing list to Apache, please 
cross post this sort of thing with dev@druid.apache.org, or even only post to 
that list.

Gian

On Sat, Mar 31, 2018 at 9:42 AM, Prashant Deva 
wrote:

> i feel atleast the documentation should be written to assume that
> overlord+coordinator is the default config and separate overlord is 
> overlord+the
> 'legacy' one.
>
> is there any actual issues holding keeping overlord as separate node?
>
> --
> You received this message because you are subscribed to the Google 
> Groups "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to druid-development+unsubscr...@googlegroups.com.
> To post to this group, send email to druid-developm...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/ms 
> gid/druid-development/2d22212b-23dc-4654-9b48-df8439cb62ad%
> 40googlegroups.com
>  4-9b48-df8439cb62ad%40googlegroups.com?utm_medium=email_source=foo
> ter>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



RE: Roll call?

2018-03-25 Thread Charles Allen
Its been called out in the weekly sync up and sent to the mailing lists. I 
think it is safe to move things here in the sense that there’s not much else we 
can do for anyone who may have not moved over yet.



Cheers,

Charles Allen






From: Julian Hyde <jh...@apache.org>
Sent: Saturday, March 24, 2018 5:29:44 PM
To: dev@druid.apache.org
Subject: Roll call?

Do we believe that all PPMC members (initial committers and mentors) have now 
joined the dev and private lists?

(As a moderator of the mailing lists, I ought to know how to check who is on 
the list, but I don't. Maybe we should have a roll call.)

If everyone is on the lists, I think we should move all business to the Apache 
lists.

Julian


-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org