Re: Hive MATCHPATH

2018-05-07 Thread Alexander Behm
No, I'm afraid not.

On Mon, May 7, 2018 at 4:41 PM, Alexandra Rodoni 
wrote:

> Does Impala support an equivalent for the Hive matchpath (formally known as
> npath) UDTF?
>


Re: Unsupported major.minor version 52.0

2018-04-26 Thread Alexander Behm
Are you running a Java8 JDK?

On Thu, Apr 26, 2018 at 9:35 AM, Jim Apple  wrote:

> testdata/bin/run-all.sh fails for me with the following error
>
> Exception in thread "main" java.lang.UnsupportedClassVersionError:
> org/apache/hadoop/hdfs/server/namenode/NameNode : Unsupported major.minor
> version 52.0
>
> This is at gerrit HEAD, at tree hash
>
> 6e88e1b26423badb4012506c10a98f32f600dbdb IMPALA-6892:
> CheckHashAndDecrypt()
> includes file and host
>
> aka commit
>
> 518bcd3e148caa8b42011de11e971c2978fb6f3b
>
> This occurs even after a successful buildall.sh
>


Re: Bulk Cherry-Pick FGP Commits to 2.x Branch

2018-04-25 Thread Alexander Behm
I confirm that I checked the code changes and test results.

On Wed, Apr 25, 2018 at 1:17 PM, Bharath Vissapragada  wrote:

> Cool, thanks.
>
> On Wed, Apr 25, 2018 at 1:14 PM, Fredy Wijaya 
> wrote:
>
> > Yeah "Fine-grained Privileges".
> >
> >
> > On Wed, Apr 25, 2018 at 3:13 PM, Bharath Vissapragada <
> > bhara...@cloudera.com
> > > wrote:
> >
> > > FGP = "Fine-grained privileges" or something else?
> > >
> > > On Wed, Apr 25, 2018 at 1:09 PM, Fredy Wijaya 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Just to give a heads-up, I'll do bulk cherry-pick FGP commits to 2.x
> > > > branch. Build is passing in the private build and Alex has verified
> it.
> > > >
> > > > Let me know if you have any concerns.
> > > >
> > > > Thanks,
> > > > Fredy
> > > >
> > >
> >
>


Re: JIRA necromancy

2018-04-17 Thread Alexander Behm
I agree. Reopening can be very confusing.

On Tue, Apr 17, 2018 at 11:20 AM, Jim Apple  wrote:

> I'm convinced.
>
> On Tue, Apr 17, 2018 at 10:29 AM, Tim Armstrong 
> wrote:
> > I noticed that there's been a trend recently towards reopening old issues
> > instead of filing new issues. Not trying to pick on anyone but it seems
> > like its worth having a discussion about best practices.
> >
> > Personally I think reopening JIRAs is often a bad thing for a several
> > reasons:
> >
> > * We don't tend to properly triage the issue to determine if it is
> actually
> > has same root cause as the old one. E.g. the same test fails for two
> > completely different reasons.
> > * People are tempted to skimp on including diagnostic information.
> > * It gets confusing trying to figure out which version the issue was
> fixed
> > in, particularly if the new thing turns out to be a separate issue.
> > * The target version, fix version, priority, etc is wrong
> > * It automatically ends up on the plate of whoever last fixed it, rather
> > than whoever currently has bandwidth. This is particularly bad for anyone
> > who has fixed or tried to fix a lot of flaky tests over the last year or
> > two (e.g. me).
> >
> > I'd prefer if we opened new issues by default unless we're really
> confident
> > that it's the same issue. It's much easier to mark issues as duplicates
> > than it is to separate out two distinct issues tracked by one JIRA. Even
> if
> > we're pretty sure it's the same thing, I think we should think carefully
> > before re-opening issues from previous releases.
> >
> > Anyway, this is just my opinion. Do others agree or disagree?
> >
> > - Tim
>


Re: New Impala committer

2018-04-09 Thread Alexander Behm
Congratulations, Vuk!

On Mon, Apr 9, 2018 at 4:56 PM, Dimitris Tsirogiannis <
dtsirogian...@cloudera.com> wrote:

> The Project Management Committee (PMC) for Apache Impala has invited Vuk
> Ercegovac to become a committer and we are pleased to announce that they
> have accepted.
>
> Congratulations and welcome, Vuk!
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-20 Thread Alexander Behm
Today, Impala does not evaluate " != " against stats, but as
Zoltan pointed out there is a way to reasonably do that. It does not work
if we ignore NaN though, so we need to be careful.

On Tue, Feb 20, 2018 at 9:24 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:

> In parquet-mr, if you are looking for a value that is not equal to some
> reference value r and stats are min = r and max = r then that row group is
> discarded, because there can not be any other values in that row group.
>
> On Tue, Feb 20, 2018 at 6:21 PM Jim Apple <jbap...@cloudera.com> wrote:
>
> > For that predicate in particular, does Impala use stats already?
> >
> > Let's say a column contains only the intuitive notion of floats: no
> > NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
> > row-group stats are b <= $COL <= c, were a < b, we can know that the
> > whole row group can be included. The addition of NaNs doesn't change
> > that.
> >
> > OTOH, if b <= a <= c, then we have to check the whole row group, and
> > the addition of NaNs doesn't change that.
> >
> > On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <alex.b...@cloudera.com>
> > wrote:
> > > On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <z...@cloudera.com>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> Tim, I added your suggestion to introduce a new ColumnOrder to
> > PARQUET-1222
> > >> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> > >> solution.
> > >>
> > >> Alex, not writing min/max if there is a NaN is indeed a feasible
> > quick-fix,
> > >> but I think it would be better to just ignore NaN-s for the pruposes
> of
> > >> min/max stats. For reading, we can ignore stats that contain a NaN. We
> > also
> > >> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> > >> problematic, though.
> > >>
> > >
> > > I don't think ignoring NaNs is correct. Consider a predicate  !=
> > >  that would evaluate to true against NaN. We cannot reliable
> > use
> > > stats for such a predicate.
> > >
> > >
> > >>
> > >> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> > >> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me:
> > "This
> > >> function is not required to be sensitive to the sign of zero, although
> > some
> > >> implementations additionally enforce that if one argument is +0 and
> the
> > >> other is -0, then +0 is returned." [1
> > >> <http://en.cppreference.com/w/c/numeric/math/fmax>]
> > >>
> > >> Br,
> > >>
> > >> Zoltan
> > >>
> > >>
> > >>
> > >> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jbap...@cloudera.com>
> wrote:
> > >>
> > >> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> > >> > <borokna...@cloudera.com> wrote:
> > >> > > I would just like to mention that the fmax() / fmin() functions in
> > >> C/C++
> > >> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > >> > > specification:
> > >> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >> > >
> > >> > > I think this behavior is also the most intuitive and useful
> > regarding
> > >> to
> > >> > > statistics. If we want to select the max value, I think it's
> > reasonable
> > >> > to
> > >> > > ignore nulls and not-numbers.
> > >> >
> > >> > It should be noted that this is different than the total ordering
> > >> > predicate. With that predicate, -NaN < -inf < negative numbers <
> -0.0
> > >> > < +0.0 < positive numbers < +inf < +NaN
> > >> >
> > >> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > >> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > >> > 100% sure on that.
> > >> >
> > >>
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-20 Thread Alexander Behm
On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi  wrote:

> Hi,
>
> Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
>  as the preferred
> solution.
>
> Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
> but I think it would be better to just ignore NaN-s for the pruposes of
> min/max stats. For reading, we can ignore stats that contain a NaN. We also
> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> problematic, though.
>

I don't think ignoring NaNs is correct. Consider a predicate  !=
 that would evaluate to true against NaN. We cannot reliable use
stats for such a predicate.


>
> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
> function is not required to be sensitive to the sign of zero, although some
> implementations additionally enforce that if one argument is +0 and the
> other is -0, then +0 is returned." [1
> ]
>
> Br,
>
> Zoltan
>
>
>
> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple  wrote:
>
> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> >  wrote:
> > > I would just like to mention that the fmax() / fmin() functions in
> C/C++
> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > > specification:
> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >
> > > I think this behavior is also the most intuitive and useful regarding
> to
> > > statistics. If we want to select the max value, I think it's reasonable
> > to
> > > ignore nulls and not-numbers.
> >
> > It should be noted that this is different than the total ordering
> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> > < +0.0 < positive numbers < +inf < +NaN
> >
> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > 100% sure on that.
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
On Fri, Feb 16, 2018 at 9:38 AM, Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>

Good point. I agree that summarizes the issues.

It is basically treating this issue like a bug in the writer (arguable). I
don't think we should require a rev in the Parquet spec to address bugs in
writers.
Working around bugs will always require checking the writer version in all
readers.

I'm certainly on board with the cleaner solution to adjust the spec - but
writers will always have bugs.

>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <l...@cloudera.com> wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <tarmstr...@cloudera.com>
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <l...@cloudera.com> wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > alex.b...@cloudera.com>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > tarmstr...@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPoi

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
On Fri, Feb 16, 2018 at 9:15 AM, Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> I don't see a major benefit to a temporary solution. The files are already
> out there and we need to implement a fix on the read path regardless. If we
> keep writing the stats there's at least some information contained in the
> stats that readers can make use of, if they want to implement the required
> logic.
>
> Dropping stats if an NaN is encountered also doesn't really address the
> other side of the problem - an absence of a NaN in the stats doesn't imply
> an absence of a NaN in the data, so the reader can't do anything useful
> with the stats anyway unless it's NaN-aware.
>

The writer solution is to only write stats if the data does not contain
special values (common case).

>
> On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <alex.b...@cloudera.com>
> wrote:
>
> > I hope the common cases is that data files do not contain these special
> > float values. As the simplest solution, how about writers refrain from
> > populating the stats if a special value is encountered?
> >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max filtering is common on
> > float/double, but I suspect it's not.
> >
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <tarmstr...@cloudera.com>
> > wrote:
> >
> > > There is an extensibility mechanism with the ColumnOrder union - I
> think
> > > that was meant to avoid the need to add new stat fields?
> > >
> > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > change anyway, so we could add a new ColumnOrder -
> > FloatingPointTotalOrder?
> > > at the same time as fixing the gap in the spec.
> > >
> > > It could make sense to declare that the default ordering for
> > floats/doubles
> > > is not NaN-aware (i.e. the reader should assume that NaN was
> arbitrarily
> > > ordered) and readers should either implement the required logic to
> handle
> > > that correctly (I had some ideas here:
> > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > focusedCommentId=16366106=com.atlassian.jira.
> > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > or ignore the stats.
> > >
> > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbap...@cloudera.com>
> wrote:
> > >
> > > > > We could have a similar problem
> > > > > with not finding +0.0 values because a -0.0 is written to the
> > max_value
> > > > > field by some component that considers them the same.
> > > >
> > > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > > under the real-number-inspired ordering, which is distinguished from
> > > > total Ordering, and which is also what you get when you use the
> > > > default C/C++ operators <, >, <=, ==, and so on.
> > > >
> > > > You can distinguish between -0.0 and +0.0 without using total
> ordering
> > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > > at the moment.
> > > >
> > >
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
I hope the common cases is that data files do not contain these special
float values. As the simplest solution, how about writers refrain from
populating the stats if a special value is encountered?

That fix does not preclude a more thorough solution in the future, but it
addresses the common case quickly.

For existing data files we could check the writer version ignore filters on
float/double. I don't know whether min/max filtering is common on
float/double, but I suspect it's not.

On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong 
wrote:

> There is an extensibility mechanism with the ColumnOrder union - I think
> that was meant to avoid the need to add new stat fields?
>
> Given that the bug was in the Parquet spec, we'll need to make a spec
> change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
> at the same time as fixing the gap in the spec.
>
> It could make sense to declare that the default ordering for floats/doubles
> is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> ordered) and readers should either implement the required logic to handle
> that correctly (I had some ideas here:
> https://issues.apache.org/jira/browse/IMPALA-6527?
> focusedCommentId=16366106=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> or ignore the stats.
>
> On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple  wrote:
>
> > > We could have a similar problem
> > > with not finding +0.0 values because a -0.0 is written to the max_value
> > > field by some component that considers them the same.
> >
> > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > under the real-number-inspired ordering, which is distinguished from
> > total Ordering, and which is also what you get when you use the
> > default C/C++ operators <, >, <=, ==, and so on.
> >
> > You can distinguish between -0.0 and +0.0 without using total ordering
> > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > ways to distinguish, I suspect, but that's the simplest one I recall
> > at the moment.
> >
>


Re: Please Add Me as an Impala Contributor

2018-02-13 Thread Alexander Behm
Done

On Tue, Feb 13, 2018 at 9:54 AM, Fredy Wijaya  wrote:

> My JIRA username is fredyw.
>
> *Fredy Wijaya* | Software Engineer
> e. fwij...@cloudera.com
> cloudera.com 
>
> [image: Cloudera] 
>
> [image: Cloudera on Twitter]  [image:
> Cloudera on Facebook]  [image: Cloudera
> on LinkedIn] 
> --
>


Re: PHJ node assignment

2018-02-12 Thread Alexander Behm
Jeszy, the way I read your question is: How much inter-node parallelism is
good?

As usual with perf question the answer is "it depends". Involving all nodes
in the cluster for a PHJ may not work well. Intuitively, each node should
have a minimum amount of work for the cost of shipping fragments there to
be worth it. So ultimately, you need to estimate how much work each node is
going to do - which is hard (planner estimates). Involving too many nodes
can make the query slower (query startup) and less efficient (work vs.
query startup). Further, involving more nodes can exacerbate the thrift
connection issues we're aware of.

The benefit of the current policy is that it is simple and does not rely
much on planner estimates. Simple is typically more robust than fancy
because similar queries tend to have similar plans (and degrees of
parallelism).

Feel free to file a JIRA if you think this should be addressed. However, I
don't think the right policy is "always run on all nodes".




On Mon, Feb 12, 2018 at 4:18 AM, Jeszy  wrote:

> Thanks for the response, Quanlong. The behaviour you describe is broadcast
> join (versus partitioned / shuffle) - sorry for confusing usage of terms!
> Take a look at the differences in the cost model for the two (in lieu of
> better description):
> https://github.com/apache/impala/blob/master/fe/src/
> main/java/org/apache/impala/planner/DistributedPlanner.java#L444-L503
>
> A partial summary for a shuffle join would be:
> Operator  #Hosts
> 
> 05:EXCHANGE1
> 02:HASH JOIN   2
> |--04:EXCHANGE 2
> |  01:SCAN HDFS2
> 03:EXCHANGE2
> 00:SCAN HDFS   2
>
> Notice Exchanges on both sides.
>
> On 12 February 2018 at 12:51, Quanlong Huang 
> wrote:
>
> > IMU, the left side is always located with the hash join node. If the
> stats
> > are correct, the left side will always be a larger table/input. There're
> > two terminologies in the hash join algorithm: build and probe. The
> smaller
> > table that can be built into an in-memory hash table is called the
> "build"
> > input. It's represented at the right side. After the in-memory hash table
> > is built, the larger table will be scanned and rows will be probed in the
> > hash table to find matched results. The larger table is called the
> "probe"
> > input and represented at the left side.So not all rows are sent across
> the
> > network to perform a hash join. Usually the larger table is scanned
> > locally. Network traffic comes from the "build" input. It's smaller and
> > sometimes can even be represented as a BloomFilter (one kind of
> > RuntimeFilter in Impala).
> >
> > However, there's still one case that all rows are sent across the network
> > anyway. That is when all tables are not located in the Impala cluster
> (e.g.
> > Impala is deployed in a portion of the Hadoop cluster). Scanning the
> tables
> > both consumes network traffic. However, when performing hash join, the
> > results of the right side will be sent to the left side, since they have
> > smaller size and consumes less network traffic than sending the left
> side.
> >
> >
> > I find this paper in "Impala Reading List" has much more details and
> > deserves to be read more times:
> > Hash joins and hash teams in Microsoft SQL Server (Graefe, Bunker,
> Cooper)
> >
> >
> > HTH
> >
> >
> > At 2018-02-12 18:13:09, "Jeszy"  wrote:
> > >IIUC, every row scanned in a partitioned hash join (both sides) is sent
> > >across the network (an exchange on HASH(key)). The targets of this
> > exchange
> > >are nodes that have data locality with the left side of the join. Why
> does
> > >Impala do it that way?
> > >
> > >Since all rows are sent across the network anyway, Impala could just use
> > >all the nodes in the cluster. The upside would be better parallelism for
> > >the join itself as well as for all the operators sitting on top of it.
> Is
> > >there a downside I'm forgetting?
> > >If not, is there a jira tracking this already? Haven't found one.
> > >
> > >Thanks!
> >
>


Re: Freezing jenkins.impala.io Wed 2018-02-07 (today) at 6pm PST

2018-02-07 Thread Alexander Behm
Jenkins and a few plugins were updated. The service is back now at
jenkins.impala.io

Warning: I updated the "Pipeline: Supporting APIs" plugin to version 2.16
due to a severe security vulnerability. The new version stated incompatible
changes, so please let me know if you see something strange.

On Wed, Feb 7, 2018 at 2:53 PM, Alexander Behm <alex.b...@cloudera.com>
wrote:

> Jenkins will be upgraded today.
>
> Timeline:
> 6pm Wed 2018-02-07: Jenkins will stop accepting new jobs
> 10pm Wed 2018-02-07: Jenkins will restart
>
> If all goes well Jenkins should be back online the morning of Wed
> 2018-02-08 (PST)
>
>
>


Freezing jenkins.impala.io Wed 2018-02-07 (today) at 6pm PST

2018-02-07 Thread Alexander Behm
Jenkins will be upgraded today.

Timeline:
6pm Wed 2018-02-07: Jenkins will stop accepting new jobs
10pm Wed 2018-02-07: Jenkins will restart

If all goes well Jenkins should be back online the morning of Wed
2018-02-08 (PST)


Build broken: gerrit-verify-dryrun will likely crash

2018-02-07 Thread Alexander Behm
These two issues are causing frequent but non-deterministic crashes in
Impala build+test runs (aka gerrit-verify-dryrun). Please refrain from
running gerrit-verify-dryrun until the following issues are resolved.

https://issues.apache.org/jira/projects/IMPALA/issues/IMPALA-6488
https://issues.apache.org/jira/projects/IMPALA/issues/IMPALA-6484

We suspect the following commits could be offenders:
a018038df5b13f24f7980b75d755e0123ae2687d
4aafa5e9ba9fe22d2dbc7764a796b3cd04136cc0


Re: What kind of sql query can generate multiple children for EXCHANGE node?

2018-02-05 Thread Alexander Behm
Yes, we should. That comment is an ancient remnant of how we used to
execute union plans. Today, that comment is simply wrong.

On Mon, Feb 5, 2018 at 3:27 PM, Xinran Yu Tinney <yuxinran8...@gmail.com>
wrote:

> Thanks Alex, should we clarify this in the comment of ExchangNode.java
> "Typically, an ExchangeNode only has a single sender child but,
>  e.g., for distributed union queries an ExchangeNode may have one sender
> child per
>  union operand."
>
> 2018-02-05 11:53 GMT-06:00 Alexander Behm <alex.b...@cloudera.com>:
>
> > An exchange node can only have one child
> >
> > On Mon, Feb 5, 2018 at 9:52 AM, Xinran Yu Tinney <yuxinran8...@gmail.com
> >
> > wrote:
> >
> > > Hi, Impala dev,
> > >I am working on Jira IMPALA-5440
> > > <https://issues.apache.org/jira/browse/IMPALA-5440> and one of the
> > problem
> > > is to test that Impala will handle cardinality overflow when there are
> > > multiple children under EXCHANGE node. I searched among the examples of
> > > .test file and could not find one. I was wondering if anyone has
> > experience
> > > running such queries? Thanks!
> > >
> > >
> > > Xinran
> > >
> >
>


Re: Reserving standard SQL keywords next Impala release (IMPALA-3916)

2017-12-12 Thread Alexander Behm
I meant doing it in a point release.

On Tue, Dec 12, 2017 at 11:02 AM, Dimitris Tsirogiannis <
dtsirogian...@cloudera.com> wrote:

> I think this is a good idea. Maybe we should do it in the next major
> release (v3) instead of a point release, unless that's what you meant.
>
> Dimitris
>
> On Tue, Dec 12, 2017 at 10:57 AM, Alexander Behm <alex.b...@cloudera.com>
> wrote:
>
>> Reserving standard SQL keywords seems like a reasonable thing to do, but
>> it
>> is an incompatible change. I think it should be ok to include the change
>> in
>> the next Impala release (whatever comes after 2.11), but wanted to hear
>> other opinions.
>>
>> See:
>> https://issues.apache.org/jira/browse/IMPALA-3916
>>
>
>


Re: [DISCUSS] 2.11.0 release

2017-12-07 Thread Alexander Behm
Would also be great to include IMPALA-6286 which is wrong-results bug with
runtime filters.

On Thu, Dec 7, 2017 at 10:50 AM, Thomas Tauber-Marshall <
tmarsh...@cloudera.com> wrote:

> On Thu, Dec 7, 2017 at 10:37 AM Jim Apple  wrote:
>
> > I think it would be great to get a fix for
> > https://issues.apache.org/jira/browse/IMPALA-6285 in 2.11.0 if possible.
> > It
> > apparently could create a large performance boost.
> >
> > It's marked as a Blocker with affects version 2.11, but no target
> version.
> > There are a few other tickets like this:
> >
> >
> > https://issues.apache.org/jira/browse/IMPALA-3887?jql=
> project%20%3D%20IMPALA%20AND%20affectedVersion%20%3D%20%
> 22Impala%202.11.0%22%20AND%20%22Target%20Version%22%20%3D%
> 20EMPTY%20and%20resolution%20%3D%20EMPTY%20and%20Priority%
> 20%3D%20Blocker%20ORDER%20BY%20priority%20DESC
>
>
> Good point. Of the four JIRAs shown there, two are flaky tests that I don't
> think are really blockers (IMPALA-6257
>  and IMPALA-3887
> ) and the other two (
> IMPALA-6285  and
> IMPALA-6081 ) have
> reviews out that have already been +2ed. So, it seems like a good idea to
> wait for those two to go in.
>
>
> >
> >
> > On Thu, Dec 7, 2017 at 9:33 AM, Thomas Tauber-Marshall <
> > tmarsh...@cloudera.com> wrote:
> >
> > > Since the response from the community has been good, and now that all
> of
> > > the blocker JIRAs targeted for 2.11 have been closed, I propose that we
> > cut
> > > the release at:
> > >
> > > commit a4916e6d5f5f3542100af791534bfaf9ed544720
> > > Author: Michael Ho 
> > > Date:   Tue Dec 5 23:01:00 2017 -0800
> > >
> > > IMPALA-6281: Fix use-after-free in InitAuth()
> > >
> > > There are still a lot of open JIRAs
> > >  > > project%20%3D%20IMPALA%20AND%20status%20in%20(Open%2C%20%
> > >
> > 22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%
> 20Version%22%20%3D%20%
> > > 22Impala%202.11.0%22>
> > > targeted at 2.11 at lower priorities, so it would be helpful if people
> > > could go through the ones assigned to them and make sure nothing
> > important
> > > is being missed, otherwise we'll bulk update all of these to target
> 2.12
> > >
> > > If there are no further concerns, I'll start testing at that commit,
> and
> > if
> > > all goes well create a release candidate and [VOTE] thread.
> > >
> > > On Thu, Nov 30, 2017 at 2:12 PM Matthew Jacobs 
> > wrote:
> > >
> > > > +1
> > > >
> > > > Thanks, Thomas!
> > > >
> > > > On Thu, Nov 30, 2017 at 1:50 PM Michael Brown 
> > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Thu, Nov 30, 2017 at 12:46 PM, Thomas Tauber-Marshall <
> > > > > tmarsh...@cloudera.com> wrote:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > It has been over 2 months since we released Apache Impala 2.10.0
> > and
> > > > > there
> > > > > > have been new feature improvements and a good number of bug fixes
> > > > checked
> > > > > > in since then.
> > > > > >
> > > > > > I propose that we release 2.11.0 soon and I volunteer to be its
> > > release
> > > > > > manager. Please speak up and let the community know if anyone has
> > any
> > > > > > objections to this.
> > > > > >
> > > > > > Thanks,
> > > > > > Thomas
> > > > > >
> > > > >
> > > > --
> > > > Sent from My iPhone
> > > >
> > >
> >
>


Re: New Impala contributors: getting started

2017-11-29 Thread Alexander Behm
Need to change your remote with "git remote  set-url "

On Wed, Nov 29, 2017 at 5:34 PM, John Russell  wrote:

> > git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git
> ~/Impala
>
> Today, doing a pull on my already-checked-out master branch, I get:
>
> fatal: repository 'https://git-wip-us.apache.org/repos/asf/incubator-
> impala.git/' not found
>
> What's the git idiom to make an existing cloned repo not think it's based
> off incubator-impala.git?  I presume the 'incubator' part of the name went
> away upon graduation.
>
> Thanks,
> John
>
> > On Sep 3, 2017, at 7:29 PM, Jim Apple  wrote:
> >
> > If you are new to Impala and would like to contribute, you can start
> > by setting up an Impala development environment. For this you'll need
> > an Ubuntu 14.04 or 16.04 machine. Then just:
> >
> > git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git
> ~/Impala
> > source ~/Impala/bin/bootstrap_development.sh
> >
> > This will take about two hours to run, but when it is done you will be
> > ready to start developing Impala!
> >
> > If you are then ready to start developing, take a look at Impala's
> > newbie issues: https://issues.apache.org/jira/issues/?filter=12341668.
> > If you find one you like, feel free to email dev@impala.apache.org to
> > discuss it, or dig right in. Before you start, though, register on the
> > Apache JIRA system and ask someone on dev@ to assign the ticket to
> > you. That way you don't end up in a race condition with another new
> > contributor! :-D
> >
> > More detailed instructions on Impala's contribution process are
> > available on the wiki:
> > https://cwiki.apache.org/confluence/display/IMPALA/
> Contributing+to+Impala
> >
> > If you don't have an Ubuntu 14.04 or 16.04 environment available, you
> > can use Docker. First, install Docker as you normally would. Then,
> >
> > docker pull ubuntu:16.04
> > docker run --privileged --interactive --tty --name impala-dev
> ubuntu:16.04 bash
> >
> > Now, within the container:
> >
> > apt-get update
> > apt-get install sudo
> > adduser --disabled-password --gecos '' impdev
> > echo 'impdev ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
> > su - impdev
> >
> > Then, as impdev in the container:
> >
> > sudo apt-get --yes install git
> > git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git
> ~/Impala
> > source ~/Impala/bin/bootstrap_development.sh
> >
> > When that's done, start developing! When you're ready to pause, in a
> > new terminal in the host:
> >
> > docker commit impala-dev && docker stop impala-dev
> >
> > When you're ready to get back to work:
> >
> > docker start --interactive impala-dev
> >
> > If instead of committing your work and stopping the container, you
> > just want to detach from it, use ctrl-p ctrl-q. You can re-attach
> > using the start command.
>
>