[GitHub] hive pull request #154: Avoiding leaked tcp link to metastore

2017-02-27 Thread gyisgood
GitHub user gyisgood opened a pull request:

https://github.com/apache/hive/pull/154

Avoiding leaked tcp link to metastore

Using Hive.getWithFastCheck in Task.getHive will cause leaked tcp link to 
metastore when executing
SQL

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gyisgood/hive master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hive/pull/154.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #154


commit 33fed73174253b1ce0d0897df8c34ab12fcbc801
Author: gaoyang <1033397...@qq.com>
Date:   2017-02-28T07:01:07Z

modify function getHive:using Hive.get instead of Hive.getWithFastCheck,
because the latter will cause leaked tcp link to metastore when executing
SQL




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [VOTE] Drop support for Java7 in master branch

2017-02-27 Thread Rajat Khandelwal
+1

On Tue, Feb 28, 2017 at 10:37 AM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> +1
>
> Thanks
> Prasanth
>
>
>
>
> On Mon, Feb 27, 2017 at 8:55 PM -0800, "Thejas Nair" <
> thejas.n...@gmail.com> wrote:
>
>
> There was a [DISCUSS] thread on the topic of moving to jdk8 for unit tests
> [1], and many people also expressed the opinion that we should drop JDK 7
> support in Hive. Public updates by Oracle was stopped on Apr 2015 [2].
>
> This vote thread proposes to dropping JDK 7 support in the next Apache Hive
> 2.x release (ie master branch), so that we can start leveraging new
> features in Java 8 and also libraries that require java8.
>
> [1] https://s.apache.org/hive-jdk8-test
> [2] http://www.oracle.com/technetwork/java/eol-135779.html
>
> Here is my +1.
>
> Vote ends in 72 hours.
> Thanks,
> Thejas
>
> PS: I think this would fall under "Code change" under Hive-bylaws, so it
> doesn't seem to really require a formal vote thread. But I think this does
> merit wider attention than a jira ticket.
>
>


-- 
Rajat Khandelwal
Tech Lead

-- 
_
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.


[jira] [Created] (HIVE-16055) HiveServer2: Prefetch a batch from the result file so that the RPC fetch request has results available in memory

2017-02-27 Thread Vaibhav Gumashta (JIRA)
Vaibhav Gumashta created HIVE-16055:
---

 Summary: HiveServer2: Prefetch a batch from the result file so 
that the RPC fetch request has results available in memory
 Key: HIVE-16055
 URL: https://issues.apache.org/jira/browse/HIVE-16055
 Project: Hive
  Issue Type: Sub-task
  Components: HiveServer2
Affects Versions: 2.2.0
Reporter: Vaibhav Gumashta


Currently:
Client Fetch -> RPC -> Server -> File Read (via FetchTask#fetch).

If the Fetch RPC can have the next batch of results available to it in memory, 
we can return the batch faster to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16054) AMReporter should use application token instead of ugi.getCurrentUser

2017-02-27 Thread Prasanth Jayachandran (JIRA)
Prasanth Jayachandran created HIVE-16054:


 Summary: AMReporter should use application token instead of 
ugi.getCurrentUser
 Key: HIVE-16054
 URL: https://issues.apache.org/jira/browse/HIVE-16054
 Project: Hive
  Issue Type: Bug
  Components: llap
Affects Versions: 2.2.0
Reporter: Rajesh Balamohan
Assignee: Prasanth Jayachandran


During the initial creation of the ugi we user appId but later we user the user 
who submitted the request. Although this doesn't matter as long as the job 
tokens are set correctly. It is good to keep it consistent. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [VOTE] Drop support for Java7 in master branch

2017-02-27 Thread Prasanth Jayachandran
+1

Thanks
Prasanth




On Mon, Feb 27, 2017 at 8:55 PM -0800, "Thejas Nair" 
mailto:thejas.n...@gmail.com>> wrote:


There was a [DISCUSS] thread on the topic of moving to jdk8 for unit tests
[1], and many people also expressed the opinion that we should drop JDK 7
support in Hive. Public updates by Oracle was stopped on Apr 2015 [2].

This vote thread proposes to dropping JDK 7 support in the next Apache Hive
2.x release (ie master branch), so that we can start leveraging new
features in Java 8 and also libraries that require java8.

[1] https://s.apache.org/hive-jdk8-test
[2] http://www.oracle.com/technetwork/java/eol-135779.html

Here is my +1.

Vote ends in 72 hours.
Thanks,
Thejas

PS: I think this would fall under "Code change" under Hive-bylaws, so it
doesn't seem to really require a formal vote thread. But I think this does
merit wider attention than a jira ticket.



[VOTE] Drop support for Java7 in master branch

2017-02-27 Thread Thejas Nair
There was a [DISCUSS] thread on the topic of moving to jdk8 for unit tests
[1], and many people also expressed the opinion that we should drop JDK 7
support in Hive. Public updates by Oracle was stopped on Apr 2015 [2].

This vote thread proposes to dropping JDK 7 support in the next Apache Hive
2.x release (ie master branch), so that we can start leveraging new
features in Java 8 and also libraries that require java8.

[1] https://s.apache.org/hive-jdk8-test
[2] http://www.oracle.com/technetwork/java/eol-135779.html

Here is my +1.

Vote ends in 72 hours.
Thanks,
Thejas

PS: I think this would fall under "Code change" under Hive-bylaws, so it
doesn't seem to really require a formal vote thread. But I think this does
merit wider attention than a jira ticket.


Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.

2017-02-27 Thread Rui Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56687/#review166991
---




ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java (line 3178)


do we still need this? I think createEmptyFile will intern the strings for 
us?



ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java (line 173)


instead of creating a new map, can we use the pathToAliases map and intern 
the paths in-place?


- Rui Li


On Feb. 27, 2017, 7:42 p.m., Misha Dmitriev wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56687/
> ---
> 
> (Updated Feb. 27, 2017, 7:42 p.m.)
> 
> 
> Review request for hive, Chaoyu Tang, Mohit Sabharwal, and Sergio Pena.
> 
> 
> Bugs: https://issues.apache.org/jira/browse/HIVE-15882
> 
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/HIVE-15882
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> See the description of the problem in 
> https://issues.apache.org/jira/browse/HIVE-15882 Interning strings per this 
> review removes most of the overhead due to duplicate strings.
> 
> Also, where maps in several places are created from other maps, use the 
> original map's size for the new map. This is to avoid the situation when a 
> map with default capacity (typically 16) is created to hold just 2-3 entries, 
> and the rest of the internal 16-entry array is wasted.
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 
> e81cbce3e333d44a4088c10491f399e92a505293 
>   ql/src/java/org/apache/hadoop/hive/ql/hooks/Entity.java 
> 08420664d59f28f75872c25c9f8ee42577b23451 
>   ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
> e91064b9c75e8adb2b36f21ff19ec0c1539b03b9 
>   ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 
> 51530ac16c92cc75d501bfcb573557754ba0c964 
>   ql/src/java/org/apache/hadoop/hive/ql/io/SymbolicInputFormat.java 
> 55b3b551a1dac92583b6e03b10beb8172ca93d45 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java 
> 82dc89803be9cf9e0018720eeceb90ff450bfdc8 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 
> c0edde9e92314d86482b5c46178987e79fae57fe 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 
> c6ae6f290857cfd10f1023058ede99bf4a10f057 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
> 24d16812515bdfa90b4be7a295c0388fcdfe95ef 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java
>  ede4fcbe342052ad86dadebcc49da2c0f515ea98 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java
>  0882ae2c6205b1636cbc92e76ef66bb70faadc76 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
> 68b0ad9ea63f051f16fec3652d8525f7ab07eb3f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 
> d4bdd96eaf8d179bed43b8a8c3be0d338940154a 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MsckDesc.java 
> b7a7e4b7a5f8941b080c7805d224d3885885f444 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java 
> 73981e826870139a42ad881103fdb0a2ef8433a2 
> 
> Diff: https://reviews.apache.org/r/56687/diff/
> 
> 
> Testing
> ---
> 
> I've measured how much memory this change plus another one (interning 
> Properties in PartitionDesc) save in my HS2 benchmark - the result is 37%. 
> See the details in HIVE-15882.
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>



[jira] [Created] (HIVE-16053) Remove newRatio from llap JAVA_OPTS_BASE

2017-02-27 Thread Siddharth Seth (JIRA)
Siddharth Seth created HIVE-16053:
-

 Summary: Remove newRatio from llap JAVA_OPTS_BASE
 Key: HIVE-16053
 URL: https://issues.apache.org/jira/browse/HIVE-16053
 Project: Hive
  Issue Type: Bug
  Components: llap
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: HIVE-16053.01.patch

The G1GC is supposed to be able to resize regions as required. Setting the 
newRatio or other parameters which size the new gen disables this capability.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16052) MM tables: add exchange partition test after ACID integration

2017-02-27 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created HIVE-16052:
---

 Summary: MM tables: add exchange partition test after ACID 
integration
 Key: HIVE-16052
 URL: https://issues.apache.org/jira/browse/HIVE-16052
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin


exchgpartition2lel test fails if all tables are changed to MM, because of write 
ID mismatch between directories and tables when exchanging partition 
directories between tables. ACID should probably fix this because transaction 
IDs are global.
We should add a test after integrating with ACID; if it doesn't work for some 
other reason, we can either implement it as moving to a new mm_id/txn_id in 
each affected partition, or block it on MM tables.

cc [~wzheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16051) MM tables: skewjoin test fails

2017-02-27 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created HIVE-16051:
---

 Summary: MM tables: skewjoin test fails
 Key: HIVE-16051
 URL: https://issues.apache.org/jira/browse/HIVE-16051
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin


{noformat}
set hive.optimize.skewjoin = true;
set hive.skewjoin.key = 2;
set hive.optimize.metadataonly=false;

CREATE TABLE dest_j1(key INT, value STRING) STORED AS TEXTFILE tblproperties 
("transactional"="true", "transactional_properties"="insert_only");

FROM src src1 JOIN src src2 ON (src1.key = src2.key)
INSERT OVERWRITE TABLE dest_j1 SELECT src1.key, src2.value;

select count(distinct key) from dest_j1;
{noformat}
Different results for MM and non-MM table.

Probably has something to do with how skewjoin handles files; however, looking 
at MM/debugging logs, there are no suspicious deletes, and everything looks the 
same for both cases; all the logging for skewjoin row containers and stuff is 
identical between the two runs (except for the numbers/guids; the number of 
files, paths, etc. are all the same). So not sure what's going on. Probably dfs 
dump can answer this question, but it doesn't work for me currently on q files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Review Request 56334: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

2017-02-27 Thread Sergio Pena


> On Feb. 27, 2017, 11:20 p.m., Sergio Pena wrote:
> > Ship It!

Thanks Zsombor. This is good work.


- Sergio


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56334/#review166963
---


On Feb. 24, 2017, 2:56 p.m., Barna Zsombor Klara wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56334/
> ---
> 
> (Updated Feb. 24, 2017, 2:56 p.m.)
> 
> 
> Review request for hive, Ryan Blue and Sergio Pena.
> 
> 
> Bugs: HIVE-12767
> https://issues.apache.org/jira/browse/HIVE-12767
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> This is a followup on this review request: https://reviews.apache.org/r/41821
> The following exit criteria is addressed in this patch:
> 
> - Hive will read Parquet MR int96 timestamp data and adjust values using a 
> time zone from a table property, if set, or using the local time zone if it 
> is absent. No adjustment will be applied to data written by Impala.
> - Hive will write Parquet int96 timestamps using a time zone adjustment from 
> the same table property, if set, or using the local time zone if it is 
> absent. This keeps the data in the table consistent.
> - New tables created by Hive will set the table property to UTC if the global 
> option to set the property for new tables is enabled.
> - Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set 
> the property unless the global setting to do so is enabled.
> - Tables created using CREATE TABLE LIKE  will copy the property 
> of the table that is copied.
> 
> To set the timezone table property, use this:
>   create table tbl1 (ts timestamp) stored as parquet tblproperties 
> ('parquet.mr.int96.write.zone'='PST');
> 
> To set UTC as default timezone table property on new tables created, use 
> this: 
>   set parquet.mr.int96.enable.utc.write.zone=true;
>   create table tbl2 (ts timestamp) stored as parquet;
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 
> f0c129bcbe25f07f30eba14234e30a6442649437 
>   data/files/impala_int96_timestamp.parq PRE-CREATION 
>   
> itests/hive-jmh/src/main/java/org/apache/hive/benchmark/storage/ColumnarStorageBench.java
>  a14b7900afb00a7d304b0dc4f6482a2b87716919 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 
> adabe70fa8f0fe1b990c6ac578a14ff5af06fc93 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java
>  379a9135d9c631b2f473976b00f3dc87f9fec0c4 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java 
> 167f9b6516ac093fa30091daf6965de25e3eccb3 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java 
> 76d93b8e02a98c95da8a534f2820cd3e77b4bb43 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
>  604cbbcc2a9daa8594397e315cc4fd8064cc5005 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
>  ac430a67682d3dcbddee89ce132fc0c1b421e368 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java 
> 3fd75d24f3fda36967e4957e650aec19050b22f8 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java
>  b6a1a7a64db6db0bf06d2eea70a308b88f06156e 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java
>  3d5c6e6a092dd6a0303fadc6a244dad2e31cd853 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
>  f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  6b7b50a25e553629f0f492e964cc4913417cb500 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
> 934ae9f255d0c4ccaa422054fcc9e725873810d4 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java
>  670bfa609704d3001dd171b703b657f57fbd4c74 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java
>  f537ceee505c5f41d513df3c89b63453012c9979 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>  PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>  ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestNanoTimeUtils.java
>  PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_timestamp_conversion.q 
> PRE-CREATION 
>   ql/src/te

Re: Review Request 56334: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

2017-02-27 Thread Sergio Pena

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56334/#review166963
---


Ship it!




Ship It!

- Sergio Pena


On Feb. 24, 2017, 2:56 p.m., Barna Zsombor Klara wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56334/
> ---
> 
> (Updated Feb. 24, 2017, 2:56 p.m.)
> 
> 
> Review request for hive, Ryan Blue and Sergio Pena.
> 
> 
> Bugs: HIVE-12767
> https://issues.apache.org/jira/browse/HIVE-12767
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> This is a followup on this review request: https://reviews.apache.org/r/41821
> The following exit criteria is addressed in this patch:
> 
> - Hive will read Parquet MR int96 timestamp data and adjust values using a 
> time zone from a table property, if set, or using the local time zone if it 
> is absent. No adjustment will be applied to data written by Impala.
> - Hive will write Parquet int96 timestamps using a time zone adjustment from 
> the same table property, if set, or using the local time zone if it is 
> absent. This keeps the data in the table consistent.
> - New tables created by Hive will set the table property to UTC if the global 
> option to set the property for new tables is enabled.
> - Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set 
> the property unless the global setting to do so is enabled.
> - Tables created using CREATE TABLE LIKE  will copy the property 
> of the table that is copied.
> 
> To set the timezone table property, use this:
>   create table tbl1 (ts timestamp) stored as parquet tblproperties 
> ('parquet.mr.int96.write.zone'='PST');
> 
> To set UTC as default timezone table property on new tables created, use 
> this: 
>   set parquet.mr.int96.enable.utc.write.zone=true;
>   create table tbl2 (ts timestamp) stored as parquet;
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 
> f0c129bcbe25f07f30eba14234e30a6442649437 
>   data/files/impala_int96_timestamp.parq PRE-CREATION 
>   
> itests/hive-jmh/src/main/java/org/apache/hive/benchmark/storage/ColumnarStorageBench.java
>  a14b7900afb00a7d304b0dc4f6482a2b87716919 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 
> adabe70fa8f0fe1b990c6ac578a14ff5af06fc93 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java
>  379a9135d9c631b2f473976b00f3dc87f9fec0c4 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java 
> 167f9b6516ac093fa30091daf6965de25e3eccb3 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java 
> 76d93b8e02a98c95da8a534f2820cd3e77b4bb43 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
>  604cbbcc2a9daa8594397e315cc4fd8064cc5005 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
>  ac430a67682d3dcbddee89ce132fc0c1b421e368 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java 
> 3fd75d24f3fda36967e4957e650aec19050b22f8 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java
>  b6a1a7a64db6db0bf06d2eea70a308b88f06156e 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java
>  3d5c6e6a092dd6a0303fadc6a244dad2e31cd853 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
>  f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  6b7b50a25e553629f0f492e964cc4913417cb500 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
> 934ae9f255d0c4ccaa422054fcc9e725873810d4 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java
>  670bfa609704d3001dd171b703b657f57fbd4c74 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java
>  f537ceee505c5f41d513df3c89b63453012c9979 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>  PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>  ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestNanoTimeUtils.java
>  PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_timestamp_conversion.q 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out 
> PRE-CREATION 
> 

Re: Review Request 52978: HIVE-14459 TestBeeLineDriver - migration and re-enable

2017-02-27 Thread Vihang Karajgaonkar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52978/#review166945
---



LGTM (non-binding). Just one comment below. Thanks Peter!


itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CoreBeeLineDriver.java
 (line 76)


we should probably use cleanupLocalDirOnStartup(true) here. As a part of 
fixing flaky tests work, this flag was set to true for the other tests


- Vihang Karajgaonkar


On Feb. 24, 2017, 12:21 p.m., Peter Vary wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52978/
> ---
> 
> (Updated Feb. 24, 2017, 12:21 p.m.)
> 
> 
> Review request for hive, Zoltan Haindrich, Marta Kuczora, Miklos Csanady, 
> Prasanth_J, Sergey Shelukhin, Sergio Pena, Siddharth Seth, and Barna Zsombor 
> Klara.
> 
> 
> Bugs: HIVE-14459
> https://issues.apache.org/jira/browse/HIVE-14459
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Kept the changes minimal with the sole goal to be able to run the BeeLine 
> query tests multiple times successfully.
> - Enabled the driver
> - Modified the regexps to hide when comparing the results
> - Configured to run only 1 qtest file - so we can test, and could decide 
> later of the beeline testing scope
> - Added required dependencies to pom
> - Added specific results dir for beeline q.out-s
> 
> 
> Diffs
> -
> 
>   beeline/src/java/org/apache/hive/beeline/util/QFileClient.java 81f1b0e 
>   
> itests/hive-unit/src/main/java/org/apache/hive/jdbc/miniHS2/AbstractHiveService.java
>   
>   itests/hive-unit/src/main/java/org/apache/hive/jdbc/miniHS2/MiniHS2.java  
>   
> itests/qtest/src/test/java/org/apache/hadoop/hive/cli/DisabledTestBeeLineDriver.java
>  cb276e6 
>   itests/src/test/resources/testconfiguration.properties d344464 
>   itests/util/pom.xml 6d93dc1 
>   
> itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CliConfigs.java 
> af8ec67 
>   
> itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CoreBeeLineDriver.java
>  e5144e3 
>   ql/src/test/results/clientpositive/beeline/escape_comments.q.out 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/52978/diff/
> 
> 
> Testing
> ---
> 
> Manually on my computer several runs.
> Waiting for the QA tests
> 
> 
> Thanks,
> 
> Peter Vary
> 
>



Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Vihang Karajgaonkar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/
---

(Updated Feb. 27, 2017, 8:41 p.m.)


Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.


Changes
---

Addressed Sahil's comment


Bugs: HIVE-15879
https://issues.apache.org/jira/browse/HIVE-15879


Repository: hive-git


Description
---

HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
7c94c95f00492467ba27dedc9ce513e13c85ea61 
  ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
35f52cd522e0e48a333e30966245bec65cc2ec9c 

Diff: https://reviews.apache.org/r/56995/diff/


Testing
---

Tested using existing and newly added test cases


Thanks,

Vihang Karajgaonkar



Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Vihang Karajgaonkar


> On Feb. 27, 2017, 6:01 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java, 
> > line 452
> > 
> >
> > Can't the variable declartion be `Queue`, same for the other 
> > declarations below.
> 
> Vihang Karajgaonkar wrote:
> We can declare it as a Queue, but the implementation of the queue needs 
> to be thread-safe since multiple threads are going to operate on the queue at 
> the same time. I thought of declaring it as a concurrentQueue would make it 
> more clear and understandable without any performance implications. Is there 
> a particular advantage you can think of declaring it as queue? This also 
> makes sure that we have a compile time type check to ensure that calling 
> methods are using concurrent queues.
> 
> Sahil Takiar wrote:
> In case we want to change the type of queue being used, e.g. a 
> `BlockingQueue` could be used, or a custom concurrent queue.

Changed the type to Queue


- Vihang


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/#review166898
---


On Feb. 27, 2017, 6:44 p.m., Vihang Karajgaonkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56995/
> ---
> 
> (Updated Feb. 27, 2017, 6:44 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.
> 
> 
> Bugs: HIVE-15879
> https://issues.apache.org/jira/browse/HIVE-15879
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
> 7c94c95f00492467ba27dedc9ce513e13c85ea61 
>   
> ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
> 35f52cd522e0e48a333e30966245bec65cc2ec9c 
> 
> Diff: https://reviews.apache.org/r/56995/diff/
> 
> 
> Testing
> ---
> 
> Tested using existing and newly added test cases
> 
> 
> Thanks,
> 
> Vihang Karajgaonkar
> 
>



Review Request 57102: HIVE-16040 union column expansion should take aliases from the leftmost branch

2017-02-27 Thread Sergey Shelukhin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57102/
---

Review request for hive, Ashutosh Chauhan and pengcheng xiong.


Repository: hive-git


Description
---

see jira


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/parse/ParseUtils.java 473a664 
  ql/src/test/queries/clientpositive/union_pos_alias.q c4eca68 
  ql/src/test/results/clientpositive/union_pos_alias.q.out 8eddbd9 

Diff: https://reviews.apache.org/r/57102/diff/


Testing
---


Thanks,

Sergey Shelukhin



Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Sahil Takiar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/#review166935
---


Ship it!




LGTM besides one pending comment.

- Sahil Takiar


On Feb. 27, 2017, 6:44 p.m., Vihang Karajgaonkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56995/
> ---
> 
> (Updated Feb. 27, 2017, 6:44 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.
> 
> 
> Bugs: HIVE-15879
> https://issues.apache.org/jira/browse/HIVE-15879
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
> 7c94c95f00492467ba27dedc9ce513e13c85ea61 
>   
> ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
> 35f52cd522e0e48a333e30966245bec65cc2ec9c 
> 
> Diff: https://reviews.apache.org/r/56995/diff/
> 
> 
> Testing
> ---
> 
> Tested using existing and newly added test cases
> 
> 
> Thanks,
> 
> Vihang Karajgaonkar
> 
>



Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Sahil Takiar


> On Feb. 27, 2017, 6:01 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java, 
> > lines 550-556
> > 
> >
> > Not sure I follow this logic, why are two queues necessary?
> 
> Vihang Karajgaonkar wrote:
> Consider you have just one queue called nextLevel. In such a scenario the 
> worker threads are removing elements and adding to the same queue. This is 
> the classic producer/consumer model operating on a single queue. Initially I 
> tried using a single blocking queue, but the terminating condition is 
> non-trivial to implement because the worker threads may or may not produce 
> more items for the queue. In such a case in order to avoid race conditions at 
> while(!nextLevel.isEmpty()) check we will either need some kind of marker 
> element added to indicate that worker thread is done, or use wait()/notify() 
> constructs to indicate when we have reached the terminating condition. That 
> implementation was getting too complex if we wanted to avoid all race 
> conditions possible. Given that this code path will be commonly used for msck 
> repair command, I thought of preferring the safer yet performant way at the 
> small cost of allocating a new queue for each level of the directory tree. 
> Hope that explain
 s.
> 
> Sahil Takiar wrote:
> I see, yes that makes sense. While handling the terminating condition may 
> be more difficult, won't it improve performance? Right now the multi-threaded 
> listing only occurs on a single level of the directory tree at a time, and 
> the code cannot move to the next level until the current level has been fully 
> listed. Which means a few slow listStatus calls could slow down performance. 
> How difficult is it to handle the terminating condition? How about using a 
> semaphore to handle the wait()/notify() logic? It could be used to track the 
> number of outstanding directory listings that need to be done, and every time 
> a file or the max depth is hit the semaphore can be decremented.

Discussed offline, while this may be faster, there are more corner cases to 
handle. An attempt was done to use a single queue approach, but some race 
conditions popped up. For this patch, I think its safe to stick to this 
approach. This issue can be re-visited in other patches.


- Sahil


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/#review166898
---


On Feb. 27, 2017, 6:44 p.m., Vihang Karajgaonkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56995/
> ---
> 
> (Updated Feb. 27, 2017, 6:44 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.
> 
> 
> Bugs: HIVE-15879
> https://issues.apache.org/jira/browse/HIVE-15879
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
> 7c94c95f00492467ba27dedc9ce513e13c85ea61 
>   
> ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
> 35f52cd522e0e48a333e30966245bec65cc2ec9c 
> 
> Diff: https://reviews.apache.org/r/56995/diff/
> 
> 
> Testing
> ---
> 
> Tested using existing and newly added test cases
> 
> 
> Thanks,
> 
> Vihang Karajgaonkar
> 
>



Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.

2017-02-27 Thread Misha Dmitriev

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56687/
---

(Updated Feb. 27, 2017, 7:42 p.m.)


Review request for hive, Chaoyu Tang, Mohit Sabharwal, and Sergio Pena.


Changes
---

Addressed the latest comments by Rui.


Bugs: https://issues.apache.org/jira/browse/HIVE-15882

https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/HIVE-15882


Repository: hive-git


Description
---

See the description of the problem in 
https://issues.apache.org/jira/browse/HIVE-15882 Interning strings per this 
review removes most of the overhead due to duplicate strings.

Also, where maps in several places are created from other maps, use the 
original map's size for the new map. This is to avoid the situation when a map 
with default capacity (typically 16) is created to hold just 2-3 entries, and 
the rest of the internal 16-entry array is wasted.


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 
e81cbce3e333d44a4088c10491f399e92a505293 
  ql/src/java/org/apache/hadoop/hive/ql/hooks/Entity.java 
08420664d59f28f75872c25c9f8ee42577b23451 
  ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
e91064b9c75e8adb2b36f21ff19ec0c1539b03b9 
  ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 
51530ac16c92cc75d501bfcb573557754ba0c964 
  ql/src/java/org/apache/hadoop/hive/ql/io/SymbolicInputFormat.java 
55b3b551a1dac92583b6e03b10beb8172ca93d45 
  ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java 
82dc89803be9cf9e0018720eeceb90ff450bfdc8 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 
c0edde9e92314d86482b5c46178987e79fae57fe 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 
c6ae6f290857cfd10f1023058ede99bf4a10f057 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
24d16812515bdfa90b4be7a295c0388fcdfe95ef 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java
 ede4fcbe342052ad86dadebcc49da2c0f515ea98 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java
 0882ae2c6205b1636cbc92e76ef66bb70faadc76 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
68b0ad9ea63f051f16fec3652d8525f7ab07eb3f 
  ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 
d4bdd96eaf8d179bed43b8a8c3be0d338940154a 
  ql/src/java/org/apache/hadoop/hive/ql/plan/MsckDesc.java 
b7a7e4b7a5f8941b080c7805d224d3885885f444 
  ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java 
73981e826870139a42ad881103fdb0a2ef8433a2 

Diff: https://reviews.apache.org/r/56687/diff/


Testing
---

I've measured how much memory this change plus another one (interning 
Properties in PartitionDesc) save in my HS2 benchmark - the result is 37%. See 
the details in HIVE-15882.


Thanks,

Misha Dmitriev



Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Sahil Takiar


> On Feb. 27, 2017, 6:01 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java, 
> > line 452
> > 
> >
> > Can't the variable declartion be `Queue`, same for the other 
> > declarations below.
> 
> Vihang Karajgaonkar wrote:
> We can declare it as a Queue, but the implementation of the queue needs 
> to be thread-safe since multiple threads are going to operate on the queue at 
> the same time. I thought of declaring it as a concurrentQueue would make it 
> more clear and understandable without any performance implications. Is there 
> a particular advantage you can think of declaring it as queue? This also 
> makes sure that we have a compile time type check to ensure that calling 
> methods are using concurrent queues.

In case we want to change the type of queue being used, e.g. a `BlockingQueue` 
could be used, or a custom concurrent queue.


> On Feb. 27, 2017, 6:01 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java, 
> > lines 550-556
> > 
> >
> > Not sure I follow this logic, why are two queues necessary?
> 
> Vihang Karajgaonkar wrote:
> Consider you have just one queue called nextLevel. In such a scenario the 
> worker threads are removing elements and adding to the same queue. This is 
> the classic producer/consumer model operating on a single queue. Initially I 
> tried using a single blocking queue, but the terminating condition is 
> non-trivial to implement because the worker threads may or may not produce 
> more items for the queue. In such a case in order to avoid race conditions at 
> while(!nextLevel.isEmpty()) check we will either need some kind of marker 
> element added to indicate that worker thread is done, or use wait()/notify() 
> constructs to indicate when we have reached the terminating condition. That 
> implementation was getting too complex if we wanted to avoid all race 
> conditions possible. Given that this code path will be commonly used for msck 
> repair command, I thought of preferring the safer yet performant way at the 
> small cost of allocating a new queue for each level of the directory tree. 
> Hope that explain
 s.

I see, yes that makes sense. While handling the terminating condition may be 
more difficult, won't it improve performance? Right now the multi-threaded 
listing only occurs on a single level of the directory tree at a time, and the 
code cannot move to the next level until the current level has been fully 
listed. Which means a few slow listStatus calls could slow down performance. 
How difficult is it to handle the terminating condition? How about using a 
semaphore to handle the wait()/notify() logic? It could be used to track the 
number of outstanding directory listings that need to be done, and every time a 
file or the max depth is hit the semaphore can be decremented.


- Sahil


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/#review166898
---


On Feb. 27, 2017, 6:44 p.m., Vihang Karajgaonkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56995/
> ---
> 
> (Updated Feb. 27, 2017, 6:44 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.
> 
> 
> Bugs: HIVE-15879
> https://issues.apache.org/jira/browse/HIVE-15879
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
> 7c94c95f00492467ba27dedc9ce513e13c85ea61 
>   
> ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
> 35f52cd522e0e48a333e30966245bec65cc2ec9c 
> 
> Diff: https://reviews.apache.org/r/56995/diff/
> 
> 
> Testing
> ---
> 
> Tested using existing and newly added test cases
> 
> 
> Thanks,
> 
> Vihang Karajgaonkar
> 
>



Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.

2017-02-27 Thread Misha Dmitriev

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56687/#review166916
---




ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
(line 322)


Oh, now I see what you mean. Yes, you are right, the old code was actually 
better. I suspect that my change as you saw it was a byproduct of some other 
bigger change that I subsequently partially reverted. Anyway, I've just changed 
this code to what it was originally. Also, indeed, found one or two other 
similar changes and reverted them as well.


- Misha Dmitriev


On Feb. 24, 2017, 9:27 p.m., Misha Dmitriev wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56687/
> ---
> 
> (Updated Feb. 24, 2017, 9:27 p.m.)
> 
> 
> Review request for hive, Chaoyu Tang, Mohit Sabharwal, and Sergio Pena.
> 
> 
> Bugs: https://issues.apache.org/jira/browse/HIVE-15882
> 
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/HIVE-15882
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> See the description of the problem in 
> https://issues.apache.org/jira/browse/HIVE-15882 Interning strings per this 
> review removes most of the overhead due to duplicate strings.
> 
> Also, where maps in several places are created from other maps, use the 
> original map's size for the new map. This is to avoid the situation when a 
> map with default capacity (typically 16) is created to hold just 2-3 entries, 
> and the rest of the internal 16-entry array is wasted.
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 
> e81cbce3e333d44a4088c10491f399e92a505293 
>   ql/src/java/org/apache/hadoop/hive/ql/hooks/Entity.java 
> 08420664d59f28f75872c25c9f8ee42577b23451 
>   ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
> e91064b9c75e8adb2b36f21ff19ec0c1539b03b9 
>   ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 
> 51530ac16c92cc75d501bfcb573557754ba0c964 
>   ql/src/java/org/apache/hadoop/hive/ql/io/SymbolicInputFormat.java 
> 55b3b551a1dac92583b6e03b10beb8172ca93d45 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java 
> 82dc89803be9cf9e0018720eeceb90ff450bfdc8 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 
> c0edde9e92314d86482b5c46178987e79fae57fe 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 
> c6ae6f290857cfd10f1023058ede99bf4a10f057 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
> 24d16812515bdfa90b4be7a295c0388fcdfe95ef 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java
>  ede4fcbe342052ad86dadebcc49da2c0f515ea98 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java
>  0882ae2c6205b1636cbc92e76ef66bb70faadc76 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
> 68b0ad9ea63f051f16fec3652d8525f7ab07eb3f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 
> d4bdd96eaf8d179bed43b8a8c3be0d338940154a 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MsckDesc.java 
> b7a7e4b7a5f8941b080c7805d224d3885885f444 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java 
> 73981e826870139a42ad881103fdb0a2ef8433a2 
> 
> Diff: https://reviews.apache.org/r/56687/diff/
> 
> 
> Testing
> ---
> 
> I've measured how much memory this change plus another one (interning 
> Properties in PartitionDesc) save in my HS2 benchmark - the result is 37%. 
> See the details in HIVE-15882.
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>



[jira] [Created] (HIVE-16050) Regression: Union of null with non-null

2017-02-27 Thread Gopal V (JIRA)
Gopal V created HIVE-16050:
--

 Summary: Regression: Union of null with non-null 
 Key: HIVE-16050
 URL: https://issues.apache.org/jira/browse/HIVE-16050
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Affects Versions: 2.2.0
Reporter: Gopal V
 Fix For: 2.2.0


{code}
hive> select null as c1 union select 1 as c2;
FAILED: SemanticException Schema of both sides of union should match: 
Column c1 is of type void on first table and type int on second table. Cannot 
tell the position of null AST.
{code}

This looks related to HIVE-14251 which removed certain cases from the UNION 
type coercion operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Vihang Karajgaonkar


> On Feb. 27, 2017, 6:01 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java, 
> > line 452
> > 
> >
> > Can't the variable declartion be `Queue`, same for the other 
> > declarations below.

We can declare it as a Queue, but the implementation of the queue needs to be 
thread-safe since multiple threads are going to operate on the queue at the 
same time. I thought of declaring it as a concurrentQueue would make it more 
clear and understandable without any performance implications. Is there a 
particular advantage you can think of declaring it as queue? This also makes 
sure that we have a compile time type check to ensure that calling methods are 
using concurrent queues.


> On Feb. 27, 2017, 6:01 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java, 
> > line 576
> > 
> >
> > Usually better to `throw new HiveException(e)` so that the full 
> > stack-trace is included.

Agreed, fixed.


> On Feb. 27, 2017, 6:01 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java, 
> > lines 550-556
> > 
> >
> > Not sure I follow this logic, why are two queues necessary?

Consider you have just one queue called nextLevel. In such a scenario the 
worker threads are removing elements and adding to the same queue. This is the 
classic producer/consumer model operating on a single queue. Initially I tried 
using a single blocking queue, but the terminating condition is non-trivial to 
implement because the worker threads may or may not produce more items for the 
queue. In such a case in order to avoid race conditions at 
while(!nextLevel.isEmpty()) check we will either need some kind of marker 
element added to indicate that worker thread is done, or use wait()/notify() 
constructs to indicate when we have reached the terminating condition. That 
implementation was getting too complex if we wanted to avoid all race 
conditions possible. Given that this code path will be commonly used for msck 
repair command, I thought of preferring the safer yet performant way at the 
small cost of allocating a new queue for each level of the directory tree. Hope 
that explains.


- Vihang


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/#review166898
---


On Feb. 27, 2017, 6:44 p.m., Vihang Karajgaonkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56995/
> ---
> 
> (Updated Feb. 27, 2017, 6:44 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.
> 
> 
> Bugs: HIVE-15879
> https://issues.apache.org/jira/browse/HIVE-15879
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
> 7c94c95f00492467ba27dedc9ce513e13c85ea61 
>   
> ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
> 35f52cd522e0e48a333e30966245bec65cc2ec9c 
> 
> Diff: https://reviews.apache.org/r/56995/diff/
> 
> 
> Testing
> ---
> 
> Tested using existing and newly added test cases
> 
> 
> Thanks,
> 
> Vihang Karajgaonkar
> 
>



Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Vihang Karajgaonkar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/
---

(Updated Feb. 27, 2017, 6:44 p.m.)


Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.


Changes
---

Addressed Sahil's comments


Bugs: HIVE-15879
https://issues.apache.org/jira/browse/HIVE-15879


Repository: hive-git


Description
---

HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
7c94c95f00492467ba27dedc9ce513e13c85ea61 
  ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
35f52cd522e0e48a333e30966245bec65cc2ec9c 

Diff: https://reviews.apache.org/r/56995/diff/


Testing
---

Tested using existing and newly added test cases


Thanks,

Vihang Karajgaonkar



Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Vihang Karajgaonkar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/#review166900
---




ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java (line 
443)


We can declare it as a Queue, but the implementation of the queue needs to 
be thread-safe since multiple threads are going to operate on the queue at the 
same time. I thought of declaring it as a concurrentQueue would make it more 
clear and understandable without any performance implications. Is there a 
particular advantage you can think of declaring it as queue? This also makes 
sure that we have a compile time type check to ensure that calling methods are 
using concurrent queues.



ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java (lines 
520 - 526)


Consider you have just one queue called nextLevel. In such a scenario the 
worker threads are removing elements and adding to the same queue. This is the 
classic producer/consumer model operating on a single queue. Initially I tried 
using a single blocking queue, but the terminating condition is non-trivial to 
implement because the worker threads may or may not produce more items for the 
queue. In such a case in order to avoid race conditions at 
while(!nextLevel.isEmpty()) check we will either need some kind of marker 
element added to indicate that worker thread is done, or use wait()/notify() 
constructs to indicate when we have reached the terminating condition. That 
implementation was getting too complex if we wanted to avoid all race 
conditions possible. Given that this code path will be commonly used for msck 
repair command, I thought of preferring the safer yet performant way at the 
small cost of allocating a new queue for each level of the directory tree. Hope 
that explains.



ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java (line 
546)


I just replicated what the previous code was throwing. Agreed, better to 
throw new HiveException(e). Changed.


- Vihang Karajgaonkar


On Feb. 24, 2017, 7:20 p.m., Vihang Karajgaonkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56995/
> ---
> 
> (Updated Feb. 24, 2017, 7:20 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.
> 
> 
> Bugs: HIVE-15879
> https://issues.apache.org/jira/browse/HIVE-15879
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
> 7c94c95f00492467ba27dedc9ce513e13c85ea61 
>   
> ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
> 35f52cd522e0e48a333e30966245bec65cc2ec9c 
> 
> Diff: https://reviews.apache.org/r/56995/diff/
> 
> 
> Testing
> ---
> 
> Tested using existing and newly added test cases
> 
> 
> Thanks,
> 
> Vihang Karajgaonkar
> 
>



Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Sahil Takiar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/#review166898
---




ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java (line 
443)


Can't the variable declartion be `Queue`, same for the other declarations 
below.



ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java (lines 
520 - 526)


Not sure I follow this logic, why are two queues necessary?



ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java (line 
546)


Usually better to `throw new HiveException(e)` so that the full stack-trace 
is included.


- Sahil Takiar


On Feb. 24, 2017, 7:20 p.m., Vihang Karajgaonkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56995/
> ---
> 
> (Updated Feb. 24, 2017, 7:20 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.
> 
> 
> Bugs: HIVE-15879
> https://issues.apache.org/jira/browse/HIVE-15879
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
> 7c94c95f00492467ba27dedc9ce513e13c85ea61 
>   
> ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
> 35f52cd522e0e48a333e30966245bec65cc2ec9c 
> 
> Diff: https://reviews.apache.org/r/56995/diff/
> 
> 
> Testing
> ---
> 
> Tested using existing and newly added test cases
> 
> 
> Thanks,
> 
> Vihang Karajgaonkar
> 
>



[jira] [Created] (HIVE-16049) upgrade to jetty 9

2017-02-27 Thread Sean Busbey (JIRA)
Sean Busbey created HIVE-16049:
--

 Summary: upgrade to jetty 9
 Key: HIVE-16049
 URL: https://issues.apache.org/jira/browse/HIVE-16049
 Project: Hive
  Issue Type: Improvement
Reporter: Sean Busbey
Assignee: Sean Busbey


Jetty 7 has been deprecated for a couple of years now. Hadoop and HBase have 
both updated to Jetty 9 for their next major releases, which will complicate 
classpath concerns.

Proactively update to Jetty 9 in the few places we use a web server.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Review Request 56995: HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-27 Thread Peter Vary

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56995/#review166889
---


Ship it!




Me, and Yetus are satisfied :)
Thanks for the patch!

LGTM (non binding)

- Peter Vary


On Feb. 24, 2017, 7:20 p.m., Vihang Karajgaonkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56995/
> ---
> 
> (Updated Feb. 24, 2017, 7:20 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Ashutosh Chauhan, and Sergio Pena.
> 
> 
> Bugs: HIVE-15879
> https://issues.apache.org/jira/browse/HIVE-15879
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-15879 : Fix HiveMetaStoreChecker.checkPartitionDirs method
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreChecker.java 
> 7c94c95f00492467ba27dedc9ce513e13c85ea61 
>   
> ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
> 35f52cd522e0e48a333e30966245bec65cc2ec9c 
> 
> Diff: https://reviews.apache.org/r/56995/diff/
> 
> 
> Testing
> ---
> 
> Tested using existing and newly added test cases
> 
> 
> Thanks,
> 
> Vihang Karajgaonkar
> 
>



Review Request 57094: HIVE-15996

2017-02-27 Thread Jesús Camacho Rodríguez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57094/
---

Review request for hive and Ashutosh Chauhan.


Bugs: HIVE-15996
https://issues.apache.org/jira/browse/HIVE-15996


Repository: hive-git


Description
---

HIVE-15996


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
0872e535a9b6a09569c02fc498dab16867ca8783 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFGrouping.java 
cc015268de0ab1e32401c1cf7c21502fb2a45331 
  ql/src/test/queries/clientpositive/groupby_grouping_sets_grouping.q 
78560978ccb84c0737decf45cb56810195be288d 
  ql/src/test/results/clientpositive/groupby_grouping_sets_grouping.q.out 
6917dbabb96b5a582416ac7f07968b739823a13f 

Diff: https://reviews.apache.org/r/57094/diff/


Testing
---


Thanks,

Jesús Camacho Rodríguez



[jira] [Created] (HIVE-16048) Hive UDF doesn't get the right evaluate method

2017-02-27 Thread Liao, Xiaoge (JIRA)
Liao, Xiaoge created HIVE-16048:
---

 Summary: Hive UDF doesn't get the right evaluate method
 Key: HIVE-16048
 URL: https://issues.apache.org/jira/browse/HIVE-16048
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 1.1.1
Reporter: Liao, Xiaoge


Hive UDF doesn't get the right evaluate method if there is variable parameter 
in the method of evaluate.
For example:
public class TestUdf extends UDF {
public String evaluate(String a, String b) throws ParseException {
return a + ":" + b;
}
public String evaluate(String a, String... b) throws ParseException {
return b[0] + ":" + a;
}
}

the udf may get the wrong result



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.

2017-02-27 Thread Rui Li


> On Feb. 24, 2017, 7:38 a.m., Rui Li wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java,
> >  line 322
> > 
> >
> > will this cause the hash map to resize since the default load factor is 
> > 0.75? and several similar concerns below
> 
> Misha Dmitriev wrote:
> You are probably right, in that this constructor's parameter is the 
> initial capacity of this table (more or less the size of the internal array) 
> - not how many elements the table is expected to hold. However, if you check 
> the code of HashMap, the things are more interesting. The actual capacity of 
> the table is always a power of two, so unless this parameter is also a power 
> of two, the capacity will be chosen as the nearest higher power of two, i.e. 
> it will be higher than the parameter and closer to what we actually need. 
> Also, if we create a table with the default size (16) here and then will put 
> many more elements into it, it will be resized several times, whereas with 
> the current code it will be resized at most once. Trying to "factor in" the 
> load factor will likely add more confusion/complexity. All in all, given that 
> choosing capacity in HashMap internally is non-trivial, I think it's 
> easier/safer to just call 'new HashMap(oldMap.size())' as we do now.

Then could you explain why we need to change the current code? The JavaDoc of 
LinkedHashMap(Map m) indicates it will create an 
instance "with a default load factor (0.75) and an initial capacity sufficient 
to hold the mappings in the specified map". Looking at the code, it computes 
the initial cap like "m.size()/loadFactor + 1", rounds it to next power of two, 
and it avoids re-hashing. Won't that be good enough for us?


- Rui


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56687/#review166649
---


On Feb. 24, 2017, 9:27 p.m., Misha Dmitriev wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56687/
> ---
> 
> (Updated Feb. 24, 2017, 9:27 p.m.)
> 
> 
> Review request for hive, Chaoyu Tang, Mohit Sabharwal, and Sergio Pena.
> 
> 
> Bugs: https://issues.apache.org/jira/browse/HIVE-15882
> 
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/HIVE-15882
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> See the description of the problem in 
> https://issues.apache.org/jira/browse/HIVE-15882 Interning strings per this 
> review removes most of the overhead due to duplicate strings.
> 
> Also, where maps in several places are created from other maps, use the 
> original map's size for the new map. This is to avoid the situation when a 
> map with default capacity (typically 16) is created to hold just 2-3 entries, 
> and the rest of the internal 16-entry array is wasted.
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 
> e81cbce3e333d44a4088c10491f399e92a505293 
>   ql/src/java/org/apache/hadoop/hive/ql/hooks/Entity.java 
> 08420664d59f28f75872c25c9f8ee42577b23451 
>   ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
> e91064b9c75e8adb2b36f21ff19ec0c1539b03b9 
>   ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 
> 51530ac16c92cc75d501bfcb573557754ba0c964 
>   ql/src/java/org/apache/hadoop/hive/ql/io/SymbolicInputFormat.java 
> 55b3b551a1dac92583b6e03b10beb8172ca93d45 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java 
> 82dc89803be9cf9e0018720eeceb90ff450bfdc8 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 
> c0edde9e92314d86482b5c46178987e79fae57fe 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 
> c6ae6f290857cfd10f1023058ede99bf4a10f057 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
> 24d16812515bdfa90b4be7a295c0388fcdfe95ef 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java
>  ede4fcbe342052ad86dadebcc49da2c0f515ea98 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java
>  0882ae2c6205b1636cbc92e76ef66bb70faadc76 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
> 68b0ad9ea63f051f16fec3652d8525f7ab07eb3f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 
> d4bdd96eaf8d179bed43b8a8c3be0d338940154a 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MsckDesc.java 
> b7a7e4b7a5f8941b080c7805d224d3885885f444 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java 
> 73981e826870139a42ad881103fdb0a2

[jira] [Created] (HIVE-16047) Shouldn't try to get KeyProvider unless encryption is enabled

2017-02-27 Thread Rui Li (JIRA)
Rui Li created HIVE-16047:
-

 Summary: Shouldn't try to get KeyProvider unless encryption is 
enabled
 Key: HIVE-16047
 URL: https://issues.apache.org/jira/browse/HIVE-16047
 Project: Hive
  Issue Type: Bug
Reporter: Rui Li
Assignee: Rui Li
Priority: Minor


Found lots of following errors in HS2 log:
{noformat}
hdfs.KeyProviderCache: Could not find uri with key 
[dfs.encryption.key.provider.uri] to create a keyProvider !!
{noformat}

Similar to HDFS-7931



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16046) Broadcasting small table for Hive on Spark

2017-02-27 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created HIVE-16046:
---

 Summary: Broadcasting small table for Hive on Spark
 Key: HIVE-16046
 URL: https://issues.apache.org/jira/browse/HIVE-16046
 Project: Hive
  Issue Type: Bug
Reporter: liyunzhang_intel


currently the spark plan is 
{code}
1. TS(Small table)->Sel/Fil->HashTableSink  
   

2. TS(Small table)->Sel/Fil->HashTableSink  

   
3. HashTableDummy --
|
HashTableDummy  --
|
RootTS(Big table) ->Sel/Fil ->MapJoin 
-->Sel/Fil ->FileSink
{code}
1.   Run the small­table SparkWorks on Spark cluster, which dump to 
hashmap file
2.Run the SparkWork for the big table on Spark cluster.  Mappers 
will lookup the small­table hashmap from the file using HashTableDummy’s 
loader. 

The disadvantage of current implementation is it need long time to distribute 
cache the hash table if the hash table is large.  Here want to use 
sparkContext.broadcast() to store small table although it will keep the 
broadcast variable in driver and bring some performance decline on driver.
[~Fred], [~xuefuz], [~lirui] and [~csun], please give some suggestions on it. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)