-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43626/#review120265
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 1499)
<https://reviews.apache.org/r/43626/#comment181697>

    nit: I don't see newFiles modified outside. Can this be made as an return 
argument from Hive.copyFiles()? Instead of InOut argument.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 1746)
<https://reviews.apache.org/r/43626/#comment181698>

    nit: same here. newFiles not modified anywhere other than copyFiles.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2592)
<https://reviews.apache.org/r/43626/#comment181695>

    I think the thread pool size should be bounded. Cached thread pool creates 
threads on-demand. On sudden burst of events this could end up creating too 
many threads that increases context switch times.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2593)
<https://reviews.apache.org/r/43626/#comment181699>

    Can you also name the thread pool with setNameFormat()?



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2643)
<https://reviews.apache.org/r/43626/#comment181701>

    I think this should happen inside the callable and not in notification 
listener. The cost of creating path is cheaper when compared to moving or 
copying. 
    
    The default  method of addCallback() handles notification in the same 
thread that sent the notification. As per documentation, "Note: If the callback 
is slow or heavyweight, consider supplying an executor. If you do not supply an 
executor, addCallback will use a direct executor, which carries some caveats 
for heavier operations. For example, the callback may run on an unpredictable 
or undesirable thread".



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2645)
<https://reviews.apache.org/r/43626/#comment181706>

    I can see 2 expensive operations here
    1) listStatus
    2) mvFile
    
    I think both should be handled by the threadpool.
    Typical sequence of operations would be 
     - For each directory, submit to threadpool to compute listStatus. This 
returns List<Future>
     - Iterate the futures, invoke future.get() to get List<Path> for each 
directories. If get() throws exception then shutdown the executor and log 
exception.
     - Iterator the List<Path>, submit to threadpool to perform mvFile. This 
returns List<Future>
     - Iterate the futures, invoke futures.get() to get void/boolean. Same way 
of exception handling here as well. 
     
     I think addCallback is good approach but it adds complications when 
multi-level futures are involved like above and waiting can be tricking like 
the 1 hour waiting below.


- Prasanth_J


On Feb. 16, 2016, 9:20 p.m., Ashutosh Chauhan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43626/
> -----------------------------------------------------------
> 
> (Updated Feb. 16, 2016, 9:20 p.m.)
> 
> 
> Review request for hive and Prasanth_J.
> 
> 
> Bugs: HIVE-12988
>     https://issues.apache.org/jira/browse/HIVE-12988
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Improve dynamic partition loading IV Parallelize copyFiles()
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a92c002 
> 
> Diff: https://reviews.apache.org/r/43626/diff/
> 
> 
> Testing
> -------
> 
> Regression suite
> 
> 
> Thanks,
> 
> Ashutosh Chauhan
> 
>

Reply via email to