[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2014-05-11 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994150#comment-13994150
 ] 

Hive QA commented on HIVE-5298:
---



{color:red}Overall{color}: -1 no tests executed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12603614/HIVE-5298.1.patch

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/159/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/159/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ [[ -n '' ]]
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost 
-Dhttp.proxyPort=3128'
+ M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost 
-Dhttp.proxyPort=3128'
+ cd /data/hive-ptest/working/
+ tee /data/hive-ptest/logs/PreCommit-HIVE-Build-159/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ svn = \s\v\n ]]
+ [[ -n '' ]]
+ [[ -d apache-svn-trunk-source ]]
+ [[ ! -d apache-svn-trunk-source/.svn ]]
+ [[ ! -d apache-svn-trunk-source ]]
+ cd apache-svn-trunk-source
+ svn revert -R .
Reverted 
'metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java'
Reverted 'ql/src/test/results/clientpositive/database_drop.q.out'
Reverted 'ql/src/test/queries/clientpositive/database_drop.q'
++ awk '{print $2}'
++ egrep -v '^X|^Performing status on external'
++ svn status --no-ignore
+ rm -rf target datanucleus.log ant/target shims/target shims/0.20/target 
shims/0.20S/target shims/0.23/target shims/aggregator/target 
shims/common/target shims/common-secure/target packaging/target 
hbase-handler/target testutils/target jdbc/target metastore/target 
metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java.orig
 itests/target itests/hcatalog-unit/target itests/test-serde/target 
itests/qtest/target itests/hive-minikdc/target itests/hive-unit/target 
itests/custom-serde/target itests/util/target hcatalog/target 
hcatalog/core/target hcatalog/streaming/target 
hcatalog/server-extensions/target hcatalog/webhcat/svr/target 
hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target 
hwi/target common/target common/src/gen contrib/target service/target 
serde/target beeline/target odbc/target cli/target 
ql/dependency-reduced-pom.xml ql/target 
ql/src/test/results/clientpositive/database_drop.q.out.orig 
ql/src/test/queries/clientpositive/database_drop.q.orig
+ svn update

Fetching external item into 'hcatalog/src/test/e2e/harness'
External at revision 1593663.

At revision 1593663.
+ patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hive-ptest/working/scratch/build.patch
+ [[ -f /data/hive-ptest/working/scratch/build.patch ]]
+ chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
+ /data/hive-ptest/working/scratch/smart-apply-patch.sh 
/data/hive-ptest/working/scratch/build.patch
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12603614

 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.14.0

 Attachments: HIVE-5298.1.patch, HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2013-09-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13782176#comment-13782176
 ] 

Xuefu Zhang commented on HIVE-5298:
---

[~hsubramaniyan] Thanks for your comments. I think what Ashutosh meant above 
was that the path in conf.getPathToAliases() is the partition directory, not 
individual file for the partition. Thus, the number of entries from calling  
conf.getPathToAliases() is the same as what my modified code change will get. I 
verified that Ashutosh's point is correct. Did you see any evidence of 
conf.getPathToAliases() returning each data file as entries?

 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.13.0

 Attachments: HIVE-5298.1.patch, HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2013-09-28 Thread Hari Sankar Sivarama Subramaniyan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13780749#comment-13780749
 ] 

Hari Sankar Sivarama Subramaniyan commented on HIVE-5298:
-

[~xuefuz] and [~ashutoshc] : I looked at the exact piece of code and thought of 
doing a similar optimization mentioned here while looking at one of my jiras, 
HIVE-5348. It seems like 
1. conf.getPathToAliases() gives the path to aliases mapping  
2. conf.getPathToPartitionInfo() gives the path to partition info mapping

It is clear that (1) and (2) return HashMaps of the same size, say numPaths.

In the change [~xuefuz] added the below line ,

{code:title=MapOperator.java|borderStyle=solid}
...
SetPartitionDesc partDescSet = 
new HashSetPartitionDesc(conf.getPathToPartitionInfo().values());
...

{code} 

The size of partDescSet returns the number of distinct partitions associated 
with the map operator. The size of the above partDescSet, say numParts, can be 
way less than numPaths if a partition comprises of many files. Hence the 
relatively less # of iterations. 
Hence, I would +1 since the idea behind this fix looks correct.

NB: The contents of the for loop in the original code looks kind of hairy and I 
am rewriting the contents of the for loop as part of HIVE-5348.

Thanks,
Hari

 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.13.0

 Attachments: HIVE-5298.1.patch, HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2013-09-27 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13780304#comment-13780304
 ] 

Xuefu Zhang commented on HIVE-5298:
---

[~ashutoshc] Thanks for pointing that out. My initial investigation shows that 
onefile is indeed a table or partition location (directory). On that ground, 
the patch shouldn't have much performance improvement. I'm working with QA to 
find out the reason for the reported performance gain, and will close this JIRA 
unless we are clear on this. Thanks again for your feedback.

 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.13.0

 Attachments: HIVE-5298.1.patch, HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2013-09-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778980#comment-13778980
 ] 

Ashutosh Chauhan commented on HIVE-5298:


Can you provide more detail? I think pathToPartInfo will really be returning 
partition directory (I think variable oneFile is misnamed there). If so, it 
seems like for loop will have same # of iterations before and after patch. I 
don't get from where the perf advantage is coming from.

 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.13.0

 Attachments: HIVE-5298.1.patch, HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2013-09-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13779059#comment-13779059
 ] 

Xuefu Zhang commented on HIVE-5298:
---

[~ashutoshc] The issue isn't with pathToPartInfo. The issue is in original code 
{code}
try {
  for (String onefile : conf.getPathToAliases().keySet()) {
PartitionDesc pd = conf.getPathToPartitionInfo().get(onefile);
{code}

where conf.getPathToAliases().keySet() returns all files that belong to the 
table. Thus, the for loop has to go thru each file in the table to get 
partition metadata (using pathToPartInfo()). However, it's only necessary to do 
so per partition, and a partition may have any number of files.

Let me know if more info is needed.


 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.13.0

 Attachments: HIVE-5298.1.patch, HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2013-09-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13779088#comment-13779088
 ] 

Ashutosh Chauhan commented on HIVE-5298:


Have you verified onefile is really a file, not a partition dir? I am under the 
impression that onefile is really a partition dir or table dir for 
unpartitioned tables.

 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.13.0

 Attachments: HIVE-5298.1.patch, HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2013-09-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771208#comment-13771208
 ] 

Hive QA commented on HIVE-5298:
---



{color:green}Overall{color}: +1 all checks pass

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12603614/HIVE-5298.1.patch

{color:green}SUCCESS:{color} +1 3126 tests passed

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/806/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/806/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.13.0

 Attachments: HIVE-5298.1.patch, HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-5298) AvroSerde performance problem caused by HIVE-3833

2013-09-16 Thread Vikram Dixit K (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768934#comment-13768934
 ] 

Vikram Dixit K commented on HIVE-5298:
--

Sorry about the assignment change. Some accidental typing/clicking. I do not 
know what caused it. Assigned back to Xuefu.

 AvroSerde performance problem caused by HIVE-3833
 -

 Key: HIVE-5298
 URL: https://issues.apache.org/jira/browse/HIVE-5298
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.11.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.13.0

 Attachments: HIVE-5298.patch


 HIVE-3833 fixed the targeted problem and made Hive to use partition-level 
 metadata to initialize object inspector. In doing that, however, it goes thru 
 every file under the table to access the partition metadata, which is very 
 inefficient, especially in case of multiple files per partition. This causes 
 more problem for AvroSerde because AvroSerde initialization accesses schema, 
 which is located on file system. As a result, before hive can process any 
 data, it needs to access every file for a table, which can take long enough 
 to cause job failure because of lack of job progress.
 The improvement can be made so that partition metadata is only access once 
 per partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira