Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Sergio Pena Mon, 21 Nov 2016 15:15:55 -0800


> On Nov. 21, 2016, 10:48 p.m., Aihua Xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2789
> > <https://reviews.apache.org/r/53966/diff/1/?file=1568247#file1568247line2789>
> >
> >     Do you think that will be a performance impact for HDFS if there are 
> > many files in the folder? Looks like recursive is not needed, at least for 
> > HDFS since it's to mv to 'destPath = new Path(destf, name + ("_copy_" + 
> > counter) + filetype);'?
> >     
> >     The other changes are more like refactoring, correct?


The recursive is needed because the copyFiles() may also copies directories.

for (FileStatus src : srcs) {
  ...
  if (src.isDirectory()) {
    files = srcFs.listStatus(src.getPath(), FileUtils.HIDDEN_FILES_PATH_FILTER);
  }
  ...
  for (final FileStatus srcFile : files) {
    ...
  }
}

You may see a bad performance for the first INSERT INTO statements, but you'll 
see it much better if repeated INSERT INTO are executed. 
This is not true for the current code. Here's an example with numbers:

1) Call INSERT INTO only once (creates 100 files)

   BEFORE:
     100 rename calls 
   AFTER:
     1 listFiles 
     100 exists 
     100 renames
     
2) Call INSERT INTO 100 times (creates 100 files) (performance of last 100th 
INSERT INTO)
 
   BEFORE:
     1,000 rename calls (each file will call rename() 100 times until HDFS 
returns true)
   AFTER:
     1 listFiles
     100 exists
     100 renames
     
I could try to save exists() calls in we are dealing with HDFS.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156517
-----------------------------------------------------------


On Nov. 21, 2016, 10:29 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 10:29 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the 
> scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 
> 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 
> 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out 
> c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 
> 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Reply via email to