[jira] [Created] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2017-04-05 Thread Reynold Xin (JIRA)
Reynold Xin created HIVE-16391:
--

 Summary: Publish proper Hive 1.2 jars (without including all 
dependencies in uber jar)
 Key: HIVE-16391
 URL: https://issues.apache.org/jira/browse/HIVE-16391
 Project: Hive
  Issue Type: Task
  Components: Build Infrastructure
Reporter: Reynold Xin


Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
only change in the fork is to work around the issue that Hive publishes only 
two sets of jars: one set with no dependency declared, and another with all the 
dependencies included in the published uber jar.

There is general consensus on both sides that we should remove the forked Hive.

The change in the forked version is recorded here 
https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2


Note that the fork in the past included other fixes but those have all become 
unnecessary.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-9362) Document API Gurantees

2015-02-05 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308023#comment-14308023
 ] 

Reynold Xin commented on HIVE-9362:
---

It's great to see this ticket! It is an important step towards Hive being a 
platform and would be tremendously useful to Spark.

 Document API Gurantees
 --

 Key: HIVE-9362
 URL: https://issues.apache.org/jira/browse/HIVE-9362
 Project: Hive
  Issue Type: Task
Reporter: Brock Noland
Priority: Blocker
 Fix For: 0.15.0


 This is an uber JIRA to document our API compatibility guarantees. Similar to 
 Hadoop I believe we should have 
 [InterfaceAudience|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-annotations/src/main/java/org/apache/hadoop/classification/InterfaceAudience.java]
  and 
 [InterfaceStability|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-annotations/src/main/java/org/apache/hadoop/classification/InterfaceStability.java]
  which I believe originally came from Sun.
 This project would be an effort by the Hive community including other 
 projects which depend on HIve API's to document which API's they use. 
 Although all API's that they use may not be considered {{Stable}} or even 
 {{Evolving}} we'll at least have any idea of who were are breaking when a 
 change is made.
 Beyond the Java API there is the Thrift API. Many projects directly use the 
 Thrift binding since we don't provide an API in say Python. As such I'd 
 suggest we consider the Thrift API to be {{Public}} and {{Stable}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]

2015-01-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated HIVE-9410:
--
Description: 
We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). 
It will be passed for default Hive (on MR) mode, while failed for Hive On Spark 
mode (both Standalone and Yarn-Client). 

Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue 
still exists. 

BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed.

Detail Error Message is as below (NOTE: 
de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar 
bigbenchqueriesmr.jar, and we have add command like 'add jar 
/location/to/bigbenchqueriesmr.jar;' into .sql explicitly)

{code}
INFO  [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - 
Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d
org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: 
de.bankmark.bigbench.queries.q10.SentimentUDF
Serialization trace:
genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc)
conf (org.apache.hadoop.hive.ql.exec.UDTFOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator)
childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator)
childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
right (org.apache.commons.lang3.tuple.ImmutablePair)
edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork)
at 
org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
at 
org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
...
Caused by: java.lang.ClassNotFoundException: 
de.bankmark.bigbench.queries.q10.SentimentUDF
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136)
... 55 more
{code}

  was:
We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). 
It will be passed for default Hive (on MR) mode, while failed for Hive On Spark 
mode (both Standalone and Yarn-Client). 

Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue 
still exists. 

BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed.

Detail Error Message is as below (NOTE: 
de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar 
bigbenchqueriesmr.jar, and we have add command like 'add jar 
/location/to/bigbenchqueriesmr.jar;' into .sql explicitly)

INFO  [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - 
Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d
org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: 
de.bankmark.bigbench.queries.q10.SentimentUDF
Serialization trace:

[jira] [Commented] (HIVE-7333) Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]

2014-11-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209315#comment-14209315
 ] 

Reynold Xin commented on HIVE-7333:
---

This is pretty trivial to solve. Each row in a RDD can be a batch of rows.


 Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
 -

 Key: HIVE-7333
 URL: https://issues.apache.org/jira/browse/HIVE-7333
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Rui Li
  Labels: Spark-M1

 Please refer to the design specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7333) Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]

2014-11-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209327#comment-14209327
 ] 

Reynold Xin commented on HIVE-7333:
---

Don't think any changes are necessary in Spark. At the end of the day you can 
run arbitrary code on arbitrary records for each partition - using that alone 
should be sufficient to run vectorization. 

You can even put an entire partition of records into one iterator output ...


 Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
 -

 Key: HIVE-7333
 URL: https://issues.apache.org/jira/browse/HIVE-7333
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Rui Li
  Labels: Spark-M1

 Please refer to the design specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

2014-07-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078939#comment-14078939
 ] 

Reynold Xin commented on HIVE-7334:
---

BTW definitely look at https://github.com/apache/spark/pull/1499

 Create SparkShuffler, shuffling data between map-side data processing and 
 reduce-side processing
 

 Key: HIVE-7334
 URL: https://issues.apache.org/jira/browse/HIVE-7334
 Project: Hive
  Issue Type: Sub-task
Reporter: Xuefu Zhang
Assignee: Rui Li
 Attachments: HIVE-7334.patch


 Please refer to the design spec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7387) Guava version conflict between hadoop and spark [Spark-Branch]

2014-07-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated HIVE-7387:
--

Description: 
hadoop-hdfs and hadoop-comman have dependency on guava-11.0.2.jar, and spark 
dependent on guava-14.0.1.jar. guava-11.0.2 has API conflict with guava-14.0.1, 
as Hive CLI load both dependency into classpath currently, query failed on 
either spark engine or mr engine.

{code}
java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at 
org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
at 
org.apache.spark.broadcast.HttpBroadcast.init(HttpBroadcast.scala:52)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:35)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776)
at org.apache.spark.rdd.HadoopRDD.init(HadoopRDD.scala:112)
at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:527)
at 
org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:307)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkClient.createRDD(SparkClient.java:204)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:167)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:32)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:159)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
{code}

NO PRECOMMIT TESTS. This is for spark branch only.

  was:
hadoop-hdfs and hadoop-comman have dependency on guava-11.0.2.jar, and spark 
dependent on guava-14.0.1.jar. guava-11.0.2 has API conflict with guava-14.0.1, 
as Hive CLI load both dependency into classpath currently, query failed on 
either spark engine or mr engine.

java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at 
org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
at 
org.apache.spark.broadcast.HttpBroadcast.init(HttpBroadcast.scala:52)
at 

[jira] [Commented] (HIVE-3772) Fix a concurrency bug in LazyBinaryUtils due to a static field (patch by Reynold Xin)

2012-12-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510333#comment-13510333
 ] 

Reynold Xin commented on HIVE-3772:
---

Thanks for submitting this, Mikhail. Note that this was introduced in 0.9. In 
0.7, this was not a problem ...

 Fix a concurrency bug in LazyBinaryUtils due to a static field (patch by 
 Reynold Xin)
 -

 Key: HIVE-3772
 URL: https://issues.apache.org/jira/browse/HIVE-3772
 Project: Hive
  Issue Type: Bug
Reporter: Mikhail Bautin

 Creating a JIRA for [~rxin]'s patch needed by the Shark project. 
 https://github.com/amplab/hive/commit/17e1c3dd2f6d8eca767115dc46d5a880aed8c765
 writeVLong should not use a static field due to concurrency concerns.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira