Re: performance issue on big table join

2017-11-02 Thread
Thanks Alex to reply again.

Do we have plan to support multi-thread join/aggregation?  Or it is
intented to be single thread to maximum query throughput?



2017-11-03 0:32 GMT+08:00 Alexander Behm <alex.b...@cloudera.com>:

> See my response on the other thread you started. The probe side of joins
> are are executed in a single thread per host. Impala can run multiple
> builds in parallel - but each build uses only a single thread.
> A single query might not be able to max out your CPU, but most realistic
> workloads run several queries concurrently.
>
> On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma <inte...@outlook.com> wrote:
>
> > Thanks LL. Your query options look good.
> >
> > As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in
> > some big data situations.
> > Very curious to the root cause.
> >
> >
> > 在 02/11/2017 10:00, 俊杰陈 写道:
> >
> > +user list
> >
> > 2017-11-02 9:57 GMT+08:00 俊杰陈 <cjjnj...@gmail.com> <cjjnj...@gmail.com>:
> >
> >
> > Hi Mostafa
> >
> > Cheng already put the profile in thread.
> >
> > Here is another profile for impala release version. you can also see the
> > attachment.
> >
> >
> > 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar <mmokh...@cloudera.com> <
> mmokh...@cloudera.com>:
> >
> >
> > Attaching the query profile will be most helpful to investigate this
> > issue.
> >
> > If you can capture the profile from the WebUI on the coordinator node it
> > would be great.
> >
> > On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 <cjjnj...@gmail.com> <
> cjjnj...@gmail.com> wrote:
> >
> >
> > Thanks Hongxu,
> >
> > Here are configurations on my cluster,  most of them are default values.
> > Which item do you think it may impact?
> >
> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
> > ABORT_ON_ERROR: [0]
> > ALLOW_UNSUPPORTED_FORMATS: [0]
> > APPX_COUNT_DISTINCT: [0]
> > BATCH_SIZE: [0]
> > COMPRESSION_CODEC: [NONE]
> > DEBUG_ACTION: []
> > DEFAULT_ORDER_BY_LIMIT: [-1]
> > DISABLE_CACHED_READS: [0]
> > DISABLE_CODEGEN: [0]
> > DISABLE_OUTERMOST_TOPN: [0]
> > DISABLE_ROW_RUNTIME_FILTERING: [0]
> > DISABLE_STREAMING_PREAGGREGATIONS: [0]
> > DISABLE_UNSAFE_SPILLS: [0]
> > ENABLE_EXPR_REWRITES: [1]
> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
> > EXPLAIN_LEVEL: [1]
> > HBASE_CACHE_BLOCKS: [0]
> > HBASE_CACHING: [0]
> > MAX_BLOCK_MGR_MEMORY: [0]
> > MAX_ERRORS: [100]
> > MAX_IO_BUFFERS: [0]
> > MAX_NUM_RUNTIME_FILTERS: [10]
> > MAX_SCAN_RANGE_LENGTH: [0]
> > MEM_LIMIT: [0]
> > MT_DOP: [0]
> > NUM_NODES: [0]
> > NUM_SCANNER_THREADS: [0]
> > OPTIMIZE_PARTITION_KEY_SCANS: [0]
> > PARQUET_ANNOTATE_STRINGS_UTF8: [0]
> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
> > PARQUET_FILE_SIZE: [0]
> > PREFETCH_MODE: [1]
> > QUERY_TIMEOUT_S: [0]
> > REPLICA_PREFERENCE: [0]
> > REQUEST_POOL: []
> > RESERVATION_REQUEST_TIMEOUT: [0]
> > RM_INITIAL_MEM: [0]
> > RUNTIME_BLOOM_FILTER_SIZE: [1048576]
> > RUNTIME_FILTER_MAX_SIZE: [16777216]
> > RUNTIME_FILTER_MIN_SIZE: [1048576]
> > RUNTIME_FILTER_MODE: [2]
> > RUNTIME_FILTER_WAIT_TIME_MS: [0]
> > S3_SKIP_INSERT_STAGING: [1]
> > SCAN_NODE_CODEGEN_THRESHOLD: [180]
> > SCHEDULE_RANDOM_REPLICA: [0]
> > SCRATCH_LIMIT: [-1]
> >     SEQ_COMPRESSION_MODE: [0]
> > STRICT_MODE: [0]
> > SUPPORT_START_OVER: [false]
> > SYNC_DDL: [0]
> >     V_CPU_CORES: [0]
> >
> > 2017-10-31 15:30 GMT+08:00 Hongxu Ma <inte...@outlook.com> <
> inte...@outlook.com>:
> >
> >
> > Hi JJ
> > Consider it only takes 3mins on SparkSQL, maybe there are some
> >
> > mistakes
> >
> > in
> >
> > query options.
> > Try run "set;" in impala-shell and check all query options, e.g:
> > BATCH_SIZE: [0]
> > DISABLE_CODEGEN: [0]
> > RUNTIME_FILTER_MODE: GLOBAL
> >
> > Just a guess, thanks.
> >
> > 在 27/10/2017 10:25, 俊杰陈 写道:
> > The profile file is damaged. Here is a screenshot for exec summary
> > [cid:ii_j999ymep1_15f5ba563aeabb

Re: performance issue on big table join

2017-11-01 Thread
+user list

2017-11-02 9:57 GMT+08:00 俊杰陈 <cjjnj...@gmail.com>:

> Hi Mostafa
>
> Cheng already put the profile in thread.
>
> Here is another profile for impala release version. you can also see the
> attachment.
>
>
> 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar <mmokh...@cloudera.com>:
>
>> Attaching the query profile will be most helpful to investigate this
>> issue.
>>
>> If you can capture the profile from the WebUI on the coordinator node it
>> would be great.
>>
>> On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 <cjjnj...@gmail.com> wrote:
>>
>> > Thanks Hongxu,
>> >
>> > Here are configurations on my cluster,  most of them are default values.
>> > Which item do you think it may impact?
>> >
>> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
>> > ABORT_ON_ERROR: [0]
>> > ALLOW_UNSUPPORTED_FORMATS: [0]
>> > APPX_COUNT_DISTINCT: [0]
>> > BATCH_SIZE: [0]
>> > COMPRESSION_CODEC: [NONE]
>> > DEBUG_ACTION: []
>> > DEFAULT_ORDER_BY_LIMIT: [-1]
>> > DISABLE_CACHED_READS: [0]
>> > DISABLE_CODEGEN: [0]
>> > DISABLE_OUTERMOST_TOPN: [0]
>> > DISABLE_ROW_RUNTIME_FILTERING: [0]
>> > DISABLE_STREAMING_PREAGGREGATIONS: [0]
>> > DISABLE_UNSAFE_SPILLS: [0]
>> > ENABLE_EXPR_REWRITES: [1]
>> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
>> > EXPLAIN_LEVEL: [1]
>> > HBASE_CACHE_BLOCKS: [0]
>> > HBASE_CACHING: [0]
>> > MAX_BLOCK_MGR_MEMORY: [0]
>> > MAX_ERRORS: [100]
>> > MAX_IO_BUFFERS: [0]
>> > MAX_NUM_RUNTIME_FILTERS: [10]
>> > MAX_SCAN_RANGE_LENGTH: [0]
>> > MEM_LIMIT: [0]
>> > MT_DOP: [0]
>> > NUM_NODES: [0]
>> > NUM_SCANNER_THREADS: [0]
>> > OPTIMIZE_PARTITION_KEY_SCANS: [0]
>> > PARQUET_ANNOTATE_STRINGS_UTF8: [0]
>> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
>> > PARQUET_FILE_SIZE: [0]
>> > PREFETCH_MODE: [1]
>> > QUERY_TIMEOUT_S: [0]
>> > REPLICA_PREFERENCE: [0]
>> > REQUEST_POOL: []
>> > RESERVATION_REQUEST_TIMEOUT: [0]
>> > RM_INITIAL_MEM: [0]
>> > RUNTIME_BLOOM_FILTER_SIZE: [1048576]
>> > RUNTIME_FILTER_MAX_SIZE: [16777216]
>> > RUNTIME_FILTER_MIN_SIZE: [1048576]
>> > RUNTIME_FILTER_MODE: [2]
>> > RUNTIME_FILTER_WAIT_TIME_MS: [0]
>> > S3_SKIP_INSERT_STAGING: [1]
>> > SCAN_NODE_CODEGEN_THRESHOLD: [180]
>> > SCHEDULE_RANDOM_REPLICA: [0]
>> > SCRATCH_LIMIT: [-1]
>> > SEQ_COMPRESSION_MODE: [0]
>> > STRICT_MODE: [0]
>> > SUPPORT_START_OVER: [false]
>> > SYNC_DDL: [0]
>> > V_CPU_CORES: [0]
>> >
>> > 2017-10-31 15:30 GMT+08:00 Hongxu Ma <inte...@outlook.com>:
>> >
>> > > Hi JJ
>> > > Consider it only takes 3mins on SparkSQL, maybe there are some
>> mistakes
>> > in
>> > > query options.
>> > > Try run "set;" in impala-shell and check all query options, e.g:
>> > > BATCH_SIZE: [0]
>> > > DISABLE_CODEGEN: [0]
>> > > RUNTIME_FILTER_MODE: GLOBAL
>> > >
>> > > Just a guess, thanks.
>> > >
>> > > 在 27/10/2017 10:25, 俊杰陈 写道:
>> > > The profile file is damaged. Here is a screenshot for exec summary
>> > > [cid:ii_j999ymep1_15f5ba563aeabb91]
>> > > ​
>> > >
>> > > 2017-10-27 10:04 GMT+08:00 俊杰陈 <cjjnj...@gmail.com<mailto:cjj
>> > > nj...@gmail.com>>:
>> > > Hi Devs
>> > >
>> > > I met a performance issue on big table join. The query takes more
>> than 3
>> > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
>> > > cluster. when running query,  the left scanner and exchange node are
>> very
>> > > slow.  Did I miss some key arguments?
>> > >
>> > > you can see profile file in attachment.
>> > >
>> > > [cid:ii_j9998pph2_15f5b92f2cf47020]
>> > > ​
>> > > --
>> > > Thanks & Best Regards
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks & Best Regards
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > Hongxu.
>> > >
>> >
>> >
>> >
>> > --
>> > Thanks & Best Regards
>> >
>>
>
>
>
> --
> Thanks & Best Regards
>



-- 
Thanks & Best Regards


Re: performance issue on big table join

2017-11-01 Thread
Hi Mostafa

Cheng already put the profile in thread.

Here is another profile for impala release version. you can also see the
attachment.


2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar <mmokh...@cloudera.com>:

> Attaching the query profile will be most helpful to investigate this issue.
>
> If you can capture the profile from the WebUI on the coordinator node it
> would be great.
>
> On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 <cjjnj...@gmail.com> wrote:
>
> > Thanks Hongxu,
> >
> > Here are configurations on my cluster,  most of them are default values.
> > Which item do you think it may impact?
> >
> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
> > ABORT_ON_ERROR: [0]
> > ALLOW_UNSUPPORTED_FORMATS: [0]
> > APPX_COUNT_DISTINCT: [0]
> > BATCH_SIZE: [0]
> > COMPRESSION_CODEC: [NONE]
> > DEBUG_ACTION: []
> > DEFAULT_ORDER_BY_LIMIT: [-1]
> > DISABLE_CACHED_READS: [0]
> > DISABLE_CODEGEN: [0]
> > DISABLE_OUTERMOST_TOPN: [0]
> > DISABLE_ROW_RUNTIME_FILTERING: [0]
> > DISABLE_STREAMING_PREAGGREGATIONS: [0]
> > DISABLE_UNSAFE_SPILLS: [0]
> > ENABLE_EXPR_REWRITES: [1]
> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
> > EXPLAIN_LEVEL: [1]
> > HBASE_CACHE_BLOCKS: [0]
> > HBASE_CACHING: [0]
> > MAX_BLOCK_MGR_MEMORY: [0]
> > MAX_ERRORS: [100]
> > MAX_IO_BUFFERS: [0]
> > MAX_NUM_RUNTIME_FILTERS: [10]
> > MAX_SCAN_RANGE_LENGTH: [0]
> > MEM_LIMIT: [0]
> > MT_DOP: [0]
> > NUM_NODES: [0]
> > NUM_SCANNER_THREADS: [0]
> > OPTIMIZE_PARTITION_KEY_SCANS: [0]
> > PARQUET_ANNOTATE_STRINGS_UTF8: [0]
> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
> > PARQUET_FILE_SIZE: [0]
> > PREFETCH_MODE: [1]
> > QUERY_TIMEOUT_S: [0]
> > REPLICA_PREFERENCE: [0]
> > REQUEST_POOL: []
> > RESERVATION_REQUEST_TIMEOUT: [0]
> > RM_INITIAL_MEM: [0]
> > RUNTIME_BLOOM_FILTER_SIZE: [1048576]
> > RUNTIME_FILTER_MAX_SIZE: [16777216]
> > RUNTIME_FILTER_MIN_SIZE: [1048576]
> > RUNTIME_FILTER_MODE: [2]
> > RUNTIME_FILTER_WAIT_TIME_MS: [0]
> > S3_SKIP_INSERT_STAGING: [1]
> > SCAN_NODE_CODEGEN_THRESHOLD: [180]
> > SCHEDULE_RANDOM_REPLICA: [0]
> > SCRATCH_LIMIT: [-1]
> > SEQ_COMPRESSION_MODE: [0]
> > STRICT_MODE: [0]
> > SUPPORT_START_OVER: [false]
> > SYNC_DDL: [0]
> > V_CPU_CORES: [0]
> >
> > 2017-10-31 15:30 GMT+08:00 Hongxu Ma <inte...@outlook.com>:
> >
> > > Hi JJ
> > > Consider it only takes 3mins on SparkSQL, maybe there are some mistakes
> > in
> > > query options.
> > > Try run "set;" in impala-shell and check all query options, e.g:
> > > BATCH_SIZE: [0]
> > > DISABLE_CODEGEN: [0]
> > > RUNTIME_FILTER_MODE: GLOBAL
> > >
> > > Just a guess, thanks.
> > >
> > > 在 27/10/2017 10:25, 俊杰陈 写道:
> > > The profile file is damaged. Here is a screenshot for exec summary
> > > [cid:ii_j999ymep1_15f5ba563aeabb91]
> > > ​
> > >
> > > 2017-10-27 10:04 GMT+08:00 俊杰陈 <cjjnj...@gmail.com<mailto:cjj
> > > nj...@gmail.com>>:
> > > Hi Devs
> > >
> > > I met a performance issue on big table join. The query takes more than
> 3
> > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> > > cluster. when running query,  the left scanner and exchange node are
> very
> > > slow.  Did I miss some key arguments?
> > >
> > > you can see profile file in attachment.
> > >
> > > [cid:ii_j9998pph2_15f5b92f2cf47020]
> > > ​
> > > --
> > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> > >
> > > --
> > > Regards,
> > > Hongxu.
> > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
>



-- 
Thanks & Best Regards
Query (id=db497c13276e70de:38c671cf):
  Summary:
Session ID: ee4f844616a8170f:5cf00c4759ee93af
Session Type: BEESWAX
Start Time: 2017-11-01 15:57:09.150268000
End Time: 2017-11-01 18:56:53.614915000
Query Type: QUERY
Query State: FINISHED
Query Status

Re: performance issue on big table join

2017-11-01 Thread
Thanks Hongxu,

Here are configurations on my cluster,  most of them are default values.
Which item do you think it may impact?

ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
ABORT_ON_ERROR: [0]
ALLOW_UNSUPPORTED_FORMATS: [0]
APPX_COUNT_DISTINCT: [0]
BATCH_SIZE: [0]
COMPRESSION_CODEC: [NONE]
DEBUG_ACTION: []
DEFAULT_ORDER_BY_LIMIT: [-1]
DISABLE_CACHED_READS: [0]
DISABLE_CODEGEN: [0]
DISABLE_OUTERMOST_TOPN: [0]
DISABLE_ROW_RUNTIME_FILTERING: [0]
DISABLE_STREAMING_PREAGGREGATIONS: [0]
DISABLE_UNSAFE_SPILLS: [0]
ENABLE_EXPR_REWRITES: [1]
EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
EXPLAIN_LEVEL: [1]
HBASE_CACHE_BLOCKS: [0]
HBASE_CACHING: [0]
MAX_BLOCK_MGR_MEMORY: [0]
MAX_ERRORS: [100]
MAX_IO_BUFFERS: [0]
MAX_NUM_RUNTIME_FILTERS: [10]
MAX_SCAN_RANGE_LENGTH: [0]
MEM_LIMIT: [0]
MT_DOP: [0]
NUM_NODES: [0]
NUM_SCANNER_THREADS: [0]
OPTIMIZE_PARTITION_KEY_SCANS: [0]
PARQUET_ANNOTATE_STRINGS_UTF8: [0]
PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
PARQUET_FILE_SIZE: [0]
PREFETCH_MODE: [1]
QUERY_TIMEOUT_S: [0]
REPLICA_PREFERENCE: [0]
REQUEST_POOL: []
RESERVATION_REQUEST_TIMEOUT: [0]
RM_INITIAL_MEM: [0]
RUNTIME_BLOOM_FILTER_SIZE: [1048576]
RUNTIME_FILTER_MAX_SIZE: [16777216]
RUNTIME_FILTER_MIN_SIZE: [1048576]
RUNTIME_FILTER_MODE: [2]
RUNTIME_FILTER_WAIT_TIME_MS: [0]
S3_SKIP_INSERT_STAGING: [1]
SCAN_NODE_CODEGEN_THRESHOLD: [180]
SCHEDULE_RANDOM_REPLICA: [0]
SCRATCH_LIMIT: [-1]
SEQ_COMPRESSION_MODE: [0]
STRICT_MODE: [0]
SUPPORT_START_OVER: [false]
SYNC_DDL: [0]
V_CPU_CORES: [0]

2017-10-31 15:30 GMT+08:00 Hongxu Ma <inte...@outlook.com>:

> Hi JJ
> Consider it only takes 3mins on SparkSQL, maybe there are some mistakes in
> query options.
> Try run "set;" in impala-shell and check all query options, e.g:
> BATCH_SIZE: [0]
> DISABLE_CODEGEN: [0]
> RUNTIME_FILTER_MODE: GLOBAL
>
> Just a guess, thanks.
>
> 在 27/10/2017 10:25, 俊杰陈 写道:
> The profile file is damaged. Here is a screenshot for exec summary
> [cid:ii_j999ymep1_15f5ba563aeabb91]
> ​
>
> 2017-10-27 10:04 GMT+08:00 俊杰陈 <cjjnj...@gmail.com<mailto:cjj
> nj...@gmail.com>>:
> Hi Devs
>
> I met a performance issue on big table join. The query takes more than 3
> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> cluster. when running query,  the left scanner and exchange node are very
> slow.  Did I miss some key arguments?
>
> you can see profile file in attachment.
>
> [cid:ii_j9998pph2_15f5b92f2cf47020]
> ​
> --
> Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards
>
>
> --
> Regards,
> Hongxu.
>



-- 
Thanks & Best Regards


Re: How many threads impala start for handling partitioned join?

2017-10-27 Thread
Thanks, Alex!

My question maybe can impala start multiple fragment instances for a
particular plan fragment on a single node,  for example, I have 5 fragment
instances for a plan fragment say F01 on a 5 nodes cluster, is that
possible to have 10 F01 instances on 5 nodes, 2 F01 instances per node?

2017-10-27 13:41 GMT+08:00 Alexander Behm <alex.b...@cloudera.com>:

> The multithreading effort is still ongoing. Joins, in particular, are not
> executed with multiple threads yet.
>
> Not sure if I completely followed your last two questions, please correct
> me if I misunderstood.
> The general idea of the multithreading effort is to start multiple fragment
> instances per host. A fragment instance may contain an exchange node.
>
>
> On Wed, Oct 25, 2017 at 7:22 PM, 俊杰陈 <cjjnj...@gmail.com> wrote:
>
> > Thanks for the reply.
> >
> > I saw IMPALA-3902 <https://issues.apache.org/jira/browse/IMPALA-3902>
> > seems
> > to add support for multithread execution.  It describes the goal is to
> > support running multiple fragment instances on a single node, is that
> means
> > coordinator generate multiple instances for a plan fragment on a single
> > node so that starts multiple exchange nodes to receive data and process?
> Or
> > it starts instances for different plan fragments for preparing
> > the streaming?
> >
> > 2017-10-25 22:08 GMT+08:00 Jeszy <jes...@gmail.com>:
> >
> > > Hello JJ,
> > >
> > > No, currently Impala uses one thread to execute the join (without
> > > regard for the amount of partitions that fit into memory).
> > >
> > > HTH
> > >
> > > On 25 October 2017 at 05:44, 俊杰陈 <cjjnj...@gmail.com> wrote:
> > > > Hi
> > > >
> > > > When Impala does a partitioned join on a node, it split the build
> input
> > > > into partitions until a partition can fit into memory and consume the
> > > probe
> > > > input then do the join and output rows.
> > > >
> > > > My question is will impala schedule multiple tasks to do join if
> > multiple
> > > > partitions fit into memory, or iterate over partitions? And for one
> > > > partition does it use multiple threads to do join?  Thanks in
> advanced.
> > > >
> > > >
> > > > JJ
> > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
>



-- 
Thanks & Best Regards


Re: performance issue on big table join

2017-10-26 Thread
The profile file is damaged. Here is a screenshot for exec summary

​

2017-10-27 10:04 GMT+08:00 俊杰陈 <cjjnj...@gmail.com>:

> Hi Devs
>
> I met a performance issue on big table join. The query takes more than 3
> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> cluster. when running query,  the left scanner and exchange node are very
> slow.  Did I miss some key arguments?
>
> you can see profile file in attachment.
>
>
> ​
> --
> Thanks & Best Regards
>



-- 
Thanks & Best Regards


performance issue on big table join

2017-10-26 Thread
Hi Devs

I met a performance issue on big table join. The query takes more than 3
hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
cluster. when running query,  the left scanner and exchange node are very
slow.  Did I miss some key arguments?

you can see profile file in attachment.


​
-- 
Thanks & Best Regards

Re: How many threads impala start for handling partitioned join?

2017-10-25 Thread
Thanks for the reply.

I saw IMPALA-3902 <https://issues.apache.org/jira/browse/IMPALA-3902> seems
to add support for multithread execution.  It describes the goal is to
support running multiple fragment instances on a single node, is that means
coordinator generate multiple instances for a plan fragment on a single
node so that starts multiple exchange nodes to receive data and process? Or
it starts instances for different plan fragments for preparing
the streaming?

2017-10-25 22:08 GMT+08:00 Jeszy <jes...@gmail.com>:

> Hello JJ,
>
> No, currently Impala uses one thread to execute the join (without
> regard for the amount of partitions that fit into memory).
>
> HTH
>
> On 25 October 2017 at 05:44, 俊杰陈 <cjjnj...@gmail.com> wrote:
> > Hi
> >
> > When Impala does a partitioned join on a node, it split the build input
> > into partitions until a partition can fit into memory and consume the
> probe
> > input then do the join and output rows.
> >
> > My question is will impala schedule multiple tasks to do join if multiple
> > partitions fit into memory, or iterate over partitions? And for one
> > partition does it use multiple threads to do join?  Thanks in advanced.
> >
> >
> > JJ
>



-- 
Thanks & Best Regards


How many threads impala start for handling partitioned join?

2017-10-24 Thread
Hi

When Impala does a partitioned join on a node, it split the build input
into partitions until a partition can fit into memory and consume the probe
input then do the join and output rows.

My question is will impala schedule multiple tasks to do join if multiple
partitions fit into memory, or iterate over partitions? And for one
partition does it use multiple threads to do join?  Thanks in advanced.


JJ


Re: vim / Eclipse setups for new developers, on the C++ side

2017-09-13 Thread
I use NetBeans to view the code, the "show call graph" is useful to me.

2017-09-14 5:44 GMT+08:00 Tim Armstrong :

> For a long time I've just used GNU screen + VIM with syntax highlighting.
> Then "git grep" or search in VIM as needed to find things. Obviously not
> ideal for everyone.
>
> I've tried YouCompleteMe recently and it works fairly well but hasn't been
> a game-changer for me. Jumping to definitions is handy sometimes but I
> haven't found that it's changed my workflow that much.
>
> On Wed, Sep 13, 2017 at 2:18 PM, Philip Zeyliger 
> wrote:
>
> > Hi folks,
> >
> > I'm querying what folks use for working on the C++ side of the code base.
> > I'm specifically interested in navigation tools for vim (better than
> > ctags), error-highlighting tools for vim (showing syntax errors and such
> > "live"), and Eclipse integration (yes, I've seen the wiki
> >  > Eclipse+Setup+for+Impala+Development>
> > ).
> >
> > I'll be happy to collate and update
> > https://cwiki.apache.org/confluence/display/IMPALA/
> > Useful+Tips+for+New+Impala+Developers
> > (or other appropriate pages) once I get some feedback!
> >
> > Thanks!
> >
> > -- Philip
> >
>



-- 
Thanks & Best Regards


Re: Impala Sorter just sort small partition?

2017-08-04 Thread
Thanks for your detail description.

My question should be more specific to quicksort part. This line
<https://github.com/apache/incubator-impala/blob/master/be/src/runtime/sorter.cc#L1258>
say
recurse on the small partition due to stack consideration, while as my
understanding quicksort should recurse on both left partition and right
partition, so I'm curious how it keep one run sorted, does it sort in later
merge sort or somewhere else?   But the merge process should take sorted
runs as input.

2017-08-05 0:18 GMT+08:00 Tim Armstrong <tarmstr...@cloudera.com>:

> The Sorter does a 3-level hybrid sort with merge sort, quicksort and
> insertion sort.
>
> SortHelper implements a 2-level hybrid in-memory sort. It fully sorts an
> arbitrarily sized in-memory input. E.g. if 'begin' and 'end' point to the
> begin and end of the sorted run, it will sort the full run. It does
> quicksort recursively then switches to insertion sort once the partitions
> are less than INSERTION_THRESHOLD = 16.
>
> Sorter also supports an external merge sort - if the full input doesn't fit
> in memory, it sorts in-memory runs with SortHelper() then does merge sort
> with the sorted runs.
>
> On Thu, Aug 3, 2017 at 11:13 PM, 俊杰陈 <cjjnj...@gmail.com> wrote:
>
> > Hi
> > I'm looking Sorter.cc and found that Sorter::SortHelper just sort smaller
> > partition. Is there anything I missed?
> >
> > --
> > Thanks & Best Regards
> >
>



-- 
Thanks & Best Regards


Impala Sorter just sort small partition?

2017-08-04 Thread
Hi
I'm looking Sorter.cc and found that Sorter::SortHelper just sort smaller
partition. Is there anything I missed?

-- 
Thanks & Best Regards


Re: material for impala newbie

2017-08-02 Thread
Thanks!

I'm now reading these useful material.

2017-08-03 1:11 GMT+08:00 Tim Armstrong <tarmstr...@cloudera.com>:

> There's also a wiki page with some pointers:
> https://cwiki.apache.org/confluence/display/IMPALA/Codegen
>
> On Wed, Aug 2, 2017 at 10:05 AM, Henry Robinson <he...@apache.org> wrote:
>
> > We don't have a lot of in-depth documentation, partly because the
> > implementation details change frequently.
> >
> > Have you read the Impala paper?
> > http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf
> > (here's a summary:
> > https://blog.acolyer.org/2015/02/05/impala-a-modern-open-
> > source-sql-engine-for-hadoop/
> > )
> >
> > There's also an old paper on code generation:
> > https://pdfs.semanticscholar.org/bac4/169d6b6f713c76271b5ccf3d452933
> > 51f785.pdf
> >
> > But the very best thing to read is the source code...
> >
> > On 2 August 2017 at 09:59, 俊杰陈 <cjjnj...@gmail.com> wrote:
> >
> > > Hi
> > >
> > > I’m learning impala code now, is there anyone has any impala doc/PPT
> for
> > > computing workflow (such as order by), vectorization, and codegen?
> > Thanks
> > > in advanced.
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>



-- 
Thanks & Best Regards


material for impala newbie

2017-08-02 Thread
Hi

I’m learning impala code now, is there anyone has any impala doc/PPT for
computing workflow (such as order by), vectorization, and codegen?  Thanks
in advanced.

-- 
Thanks & Best Regards