> On Nov. 21, 2016, 10:48 p.m., Aihua Xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2789
> > <https://reviews.apache.org/r/53966/diff/1/?file=1568247#file1568247line2789>
> >
> > Do you think that will be a performance impact for HDFS if there are
> > many files in the folder? Looks like recursive is not needed, at least for
> > HDFS since it's to mv to 'destPath = new Path(destf, name + ("_copy_" +
> > counter) + filetype);'?
> >
> > The other changes are more like refactoring, correct?
The recursive is needed because the copyFiles() may also copies directories.
for (FileStatus src : srcs) {
...
if (src.isDirectory()) {
files = srcFs.listStatus(src.getPath(), FileUtils.HIDDEN_FILES_PATH_FILTER);
}
...
for (final FileStatus srcFile : files) {
...
}
}
You may see a bad performance for the first INSERT INTO statements, but you'll
see it much better if repeated INSERT INTO are executed.
This is not true for the current code. Here's an example with numbers:
1) Call INSERT INTO only once (creates 100 files)
BEFORE:
100 rename calls
AFTER:
1 listFiles
100 exists
100 renames
2) Call INSERT INTO 100 times (creates 100 files) (performance of last 100th
INSERT INTO)
BEFORE:
1,000 rename calls (each file will call rename() 100 times until HDFS
returns true)
AFTER:
1 listFiles
100 exists
100 renames
I could try to save exists() calls in we are dealing with HDFS.
- Sergio
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156517
-----------------------------------------------------------
On Nov. 21, 2016, 10:29 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 10:29 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the
> scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java
> 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q
> 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out
> c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
> 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>