Re: Review Request 71707: Performance degradation on single row inserts

2019-11-07 Thread Attila Magyar


> On Nov. 5, 2019, 11:59 p.m., Ashutosh Chauhan wrote:
> > standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
> > Line 331 (original), 324 (patched)
> > 
> >
> > you may use BlobStorageUtils::isBlobStorageFileSystem() here.

isBlobStorageFileSystem matches to s3,s3a,s3n, but only S3AFileSystem 
(https://github.com/apache/hadoop/blob/1d5d7d0989e9ee2f4527dc47ba5c80e1c38f641a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3861)
 has an optimized listFiles() implementation.

NativeS3FileSystem 
(https://github.com/apache/hadoop/blob/1d5d7d0989e9ee2f4527dc47ba5c80e1c38f641a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/NativeS3FileSystem.java)
 uses the same tree travesing algorithm from the base class.


- Attila


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218518
---


On Nov. 7, 2019, 9:23 a.m., Attila Magyar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> ---
> 
> (Updated Nov. 7, 2019, 9:23 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
> https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 651b842f688 
>   common/src/java/org/apache/hadoop/hive/common/HiveStatsUtils.java 
> 09343e56166 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
>  38e843aeacf 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
>  bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/3/
> 
> 
> Testing
> ---
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>



Re: Review Request 71707: Performance degradation on single row inserts

2019-11-07 Thread Attila Magyar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218524
---




standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Line 331 (original), 324 (patched)


BlobStorageUtils::isBlobStorageFileSystem() checks if the scheme is either 
"s3","s3n" or "s3a". But only S3AFileSystem has the optimized listFiles(). In 
NativeS3FileSystem does not override the tree walking algorithm from the base 
class.

See: 
https://github.com/apache/hadoop/blob/1d5d7d0989e9ee2f4527dc47ba5c80e1c38f641a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3861

and:


https://github.com/apache/hadoop/blob/1d5d7d0989e9ee2f4527dc47ba5c80e1c38f641a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/NativeS3FileSystem.java


- Attila Magyar


On Nov. 7, 2019, 9:23 a.m., Attila Magyar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> ---
> 
> (Updated Nov. 7, 2019, 9:23 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
> https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 651b842f688 
>   common/src/java/org/apache/hadoop/hive/common/HiveStatsUtils.java 
> 09343e56166 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
>  38e843aeacf 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
>  bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/3/
> 
> 
> Testing
> ---
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>



Re: Review Request 71707: Performance degradation on single row inserts

2019-11-07 Thread Attila Magyar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/
---

(Updated Nov. 7, 2019, 9:23 a.m.)


Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.


Changes
---

adressing review comments


Bugs: HIVE-22411
https://issues.apache.org/jira/browse/HIVE-22411


Repository: hive-git


Description
---

Executing single insert statements on a transactional table effects write 
performance on a s3 file system. Each insert creates a new delta directory. 
After each insert hive calculates statistics like number of file in the table 
and total size of the table. In order to calculate these, it traverses the 
directory recursively. During the recursion for each path a separate listStatus 
call is executed. In the end the more delta directory you have the more time it 
takes to calculate the statistics.

Therefore insertion time goes up linearly.


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/common/FileUtils.java 651b842f688 
  common/src/java/org/apache/hadoop/hive/common/HiveStatsUtils.java 09343e56166 
  
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
 38e843aeacf 
  
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
 bf206fffc26 


Diff: https://reviews.apache.org/r/71707/diff/3/

Changes: https://reviews.apache.org/r/71707/diff/2-3/


Testing
---

measured and plotted insertation time


Thanks,

Attila Magyar



Re: Review Request 71707: Performance degradation on single row inserts

2019-11-05 Thread Ashutosh Chauhan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218518
---




standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Line 323 (original), 321 (patched)


can you please also make similiar change to 
common/src/java/org/apache/hadoop/hive/common/FileUtils.java::listStatusRecursively()
 so that method also benefits from this change.



standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Line 331 (original), 324 (patched)


you may use BlobStorageUtils::isBlobStorageFileSystem() here.



standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Lines 378 (patched)


BlobStorageUtils::isBlobStorageFileSystem() instead


- Ashutosh Chauhan


On Nov. 5, 2019, 3:32 p.m., Attila Magyar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> ---
> 
> (Updated Nov. 5, 2019, 3:32 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
> https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -
> 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
>  38e843aeacf 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
>  bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/2/
> 
> 
> Testing
> ---
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>



Re: Review Request 71707: Performance degradation on single row inserts

2019-11-05 Thread Panos Garefalakis via Review Board


> On Nov. 5, 2019, 4:33 p.m., Panos Garefalakis wrote:
> > standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
> > Lines 328 (patched)
> > 
> >
> > Hey Attila, the solution looks good however, as other fileSystems might 
> > face similar issues in the future using this recursive method (i.e. Azure 
> > Blob storage)  wouldn't it make sense to have hdfs a the base case and 
> > others separately? and maybe throw a warn message here when the filesystem 
> > is not supported?
> 
> Attila Magyar wrote:
> Hey Panos, I checked the hadoop project and I found only one FS 
> implementation with optimized recursive listFiles(), other implementations 
> use the tree walking impl. from the base class. I think that's the more 
> common case. Do you know where is the source of this Azure Blob storage? Is 
> that one open source at all?

Hey Attila, I was referring to this: 
https://hadoop.apache.org/docs/current/hadoop-azure/index.html 
but I was also assuming that the recursive method you modified be called for 
other filesystems as well - if thats not the case then my comment does not 
apply :)


- Panos


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218505
---


On Nov. 5, 2019, 3:32 p.m., Attila Magyar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> ---
> 
> (Updated Nov. 5, 2019, 3:32 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
> https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -
> 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
>  38e843aeacf 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
>  bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/2/
> 
> 
> Testing
> ---
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>



Re: Review Request 71707: Performance degradation on single row inserts

2019-11-05 Thread Attila Magyar


> On Nov. 5, 2019, 4:33 p.m., Panos Garefalakis wrote:
> > standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
> > Lines 328 (patched)
> > 
> >
> > Hey Attila, the solution looks good however, as other fileSystems might 
> > face similar issues in the future using this recursive method (i.e. Azure 
> > Blob storage)  wouldn't it make sense to have hdfs a the base case and 
> > others separately? and maybe throw a warn message here when the filesystem 
> > is not supported?

Hey Panos, I checked the hadoop project and I found only one FS implementation 
with optimized recursive listFiles(), other implementations use the tree 
walking impl. from the base class. I think that's the more common case. Do you 
know where is the source of this Azure Blob storage? Is that one open source at 
all?


- Attila


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218505
---


On Nov. 5, 2019, 3:32 p.m., Attila Magyar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> ---
> 
> (Updated Nov. 5, 2019, 3:32 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
> https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -
> 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
>  38e843aeacf 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
>  bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/2/
> 
> 
> Testing
> ---
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>



Re: Review Request 71707: Performance degradation on single row inserts

2019-11-05 Thread Panos Garefalakis via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218505
---




standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Lines 328 (patched)


Hey Attila, the solution looks good however, as other fileSystems might 
face similar issues in the future using this recursive method (i.e. Azure Blob 
storage)  wouldn't it make sense to have hdfs a the base case and others 
separately? and maybe throw a warn message here when the filesystem is not 
supported?


- Panos Garefalakis


On Nov. 5, 2019, 3:32 p.m., Attila Magyar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> ---
> 
> (Updated Nov. 5, 2019, 3:32 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
> https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -
> 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
>  38e843aeacf 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
>  bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/2/
> 
> 
> Testing
> ---
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>



Re: Review Request 71707: Performance degradation on single row inserts

2019-11-05 Thread Attila Magyar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/
---

(Updated Nov. 5, 2019, 3:32 p.m.)


Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.


Changes
---

Adressing Ashutosh's comments


Bugs: HIVE-22411
https://issues.apache.org/jira/browse/HIVE-22411


Repository: hive-git


Description
---

Executing single insert statements on a transactional table effects write 
performance on a s3 file system. Each insert creates a new delta directory. 
After each insert hive calculates statistics like number of file in the table 
and total size of the table. In order to calculate these, it traverses the 
directory recursively. During the recursion for each path a separate listStatus 
call is executed. In the end the more delta directory you have the more time it 
takes to calculate the statistics.

Therefore insertion time goes up linearly.


Diffs (updated)
-

  
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
 38e843aeacf 
  
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
 bf206fffc26 


Diff: https://reviews.apache.org/r/71707/diff/2/

Changes: https://reviews.apache.org/r/71707/diff/1-2/


Testing
---

measured and plotted insertation time


Thanks,

Attila Magyar



Re: Review Request 71707: Performance degradation on single row inserts

2019-10-31 Thread Slim Bouguerra

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218479
---



looked at the code looks good to me.

- Slim Bouguerra


On Oct. 31, 2019, 11:16 a.m., Attila Magyar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> ---
> 
> (Updated Oct. 31, 2019, 11:16 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
> https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -
> 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java
>  38e843aeacf 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
>  155ecb18bf5 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/1/
> 
> 
> Testing
> ---
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>