Re: Troubleshooting JVM OOM during Spark Unit Tests
What does /tmp/jvm-21940/hs_error.log tell you? It might give hints to what threads are allocating the extra off-heap memory. On Fri, Nov 21, 2014 at 1:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Howdy folks, I’m trying to understand why I’m getting “insufficient memory” errors when trying to run Spark Units tests within a CentOS Docker container. I’m building Spark and running the tests as follows: # build sbt/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive -Phive-thriftserver package assembly/assembly # Scala unit tests sbt/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive -Phive-thriftserver catalyst/test sql/test hive/test mllib/test The build completes successfully. After humming along for many minutes, the unit tests fail with this: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00074a58, 30932992, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 30932992 bytes for committing reserved memory. # An error report file with more information is saved as: # /tmp/jvm-21940/hs_error.log Exception in thread Thread-20 Exception in thread Thread-16 java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1$React.react(Framework.scala:945) at org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1.run(Framework.scala:934) at java.lang.Thread.run(Thread.java:745) java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.net.SocketInputStream.read(SocketInputStream.java:210) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at sbt.React.react(ForkTests.scala:114) at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74) at java.lang.Thread.run(Thread.java:745) Here are some (I think) relevant environment variables I have set: export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71-2.5.3.1.el7_0.x86_64 export JAVA_OPTS=-Xms128m -Xmx1g -XX:MaxPermSize=128m export MAVEN_OPTS=-Xmx512m -XX:MaxPermSize=128m How do I narrow down why this is happening? I know that running this thing within a Docker container may be playing a role here, but before poking around with Docker configs I want to make an effort at getting the Java setup right within the container. I’ve already tried giving the container 2GB of memory, so I don’t think at this point it’s a restriction on the container. Any pointers on how to narrow the problem down? Nick P.S. If you’re wondering why I’m trying to run unit tests within a Docker container, I’m exploring a different angle on SPARK-3431 https://issues.apache.org/jira/browse/SPARK-3431.
Why Executor Deserialize Time takes more than 300ms?
In our experimental cluster (1 driver, 5 workers), we tried the simplest example: sc.parallelize(Range(0, 100), 2).count In the event log, we found the executor takes too much time on deserialization, about 300 ~ 500ms, and the execution time is only 1ms. Our servers are with 2.3G Hz CPU * 24 cores. And, we have set the serializer to org.apache.spark.serializer.KryoSerializer . The question is, is it normal that the executor takes 300~500ms on deserialization? If not, any clue for the performance tuning?
Re: Why Executor Deserialize Time takes more than 300ms?
Hi Xuelin, this type of question is probably better asked on the spark-user mailing list, u...@spark.apache.org http://apache-spark-user-list.1001560.n3.nabble.com Do you mean the very first set of tasks take 300 - 500 ms to deserialize? That is most likely because of the time taken to ship the jars from the driver to the executors. You should only pay this cost once per spark context (assuming you are not adding more jars later on). You could try simply running the same task again, from the same spark context, and see whether it still takes that much time to deserialize the tasks. If you really want to eliminate that initial time to send the jars, you could ensure that the jars are already on the executors, so they don't need to get sent at all by spark. (Of course, this makes it harder to deploy new code; you'd still need to update those jars *somehow* when you do.) hope this helps, Imran On Sat, Nov 22, 2014 at 6:52 AM, Xuelin Cao xuelin...@yahoo.com.invalid wrote: In our experimental cluster (1 driver, 5 workers), we tried the simplest example: sc.parallelize(Range(0, 100), 2).count In the event log, we found the executor takes too much time on deserialization, about 300 ~ 500ms, and the execution time is only 1ms. Our servers are with 2.3G Hz CPU * 24 cores. And, we have set the serializer to org.apache.spark.serializer.KryoSerializer . The question is, is it normal that the executor takes 300~500ms on deserialization? If not, any clue for the performance tuning?
Re: Why Executor Deserialize Time takes more than 300ms?
Thanks Imran, The problems is, *every time* I run the same task, the deserialization time is around 300~500ms. I don't know if this is a normal case. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Executor-Deserialize-Time-takes-more-than-300ms-tp9487p9489.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
java.lang.OutOfMemoryError at simple local test
Dear all, Unfortunately I've not got ant respond in users forum. That's why I decided to publish this question here. We encountered problems of failed jobs with huge amount of data. For example, an application works perfectly with relative small sized data, but when it grows in 2 times this application fails. A simple local test was prepared for this question at https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb It generates 2 sets of key-value pairs, join them, selects distinct values and counts data finally. object Spill { def generate = { for{ j - 1 to 10 i - 1 to 200 } yield(j, i) } def main(args: Array[String]) { val conf = new SparkConf().setAppName(getClass.getSimpleName) conf.set(spark.shuffle.spill, true) conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) val sc = new SparkContext(conf) println(generate) val dataA = sc.parallelize(generate) val dataB = sc.parallelize(generate) val dst = dataA.join(dataB).distinct().count() println(dst) } } We compiled it locally and run 3 times with different settings of memory: 1) --executor-memory 10M --driver-memory 10M --num-executors 1 --executor-cores 1 It fails wtih java.lang.OutOfMemoryError: GC overhead limit exceeded at . org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) 2) --executor-memory 20M --driver-memory 20M --num-executors 1 --executor-cores 1 It works OK 3) --executor-memory 10M --driver-memory 10M --num-executors 1 --executor-cores 1 But let's make less data for i from 200 to 100. It reduces input data in 2 times and joined data in 4 times def generate = { for{ j - 1 to 10 i - 1 to 100 // previous value was 200 } yield(j, i) } This code works OK. We don't understand why 10M is not enough for such simple operation with 32000 bytes of ints (2 * 10 * 200 * 2 * 4) approximately? 10M of RAM works if we change the data volume in 2 times (2000 of records of (int, int)). Why spilling to disk doesn't cover this case? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-at-simple-local-test-tp9490.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: How spark and hive integrate in long term?
Hey Zhan, This is a great question. We are also seeking for a stable API/protocol that works with multiple Hive versions (esp. 0.12+). SPARK-4114 https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. Did some research into HCatalog recently, but I must confess that I’m not an expert on HCatalog, actually spent only 1 day on exploring it. So please don’t hesitate to correct me if I was wrong about the conclusions I made below. First, although HCatalog API is more pleasant to work with, it’s unfortunately feature incomplete. It only provides a subset of most commonly used operations. For example, |HCatCreateTableDesc| maps only a subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are missing. It’s also impossible to alter table properties via HCatalog API (Spark SQL uses this to implement the |ANALYZE| command). The |hcat| CLI tool provides all those features missing in HCatalog API via raw Metastore API, and is structurally similar to the old Hive CLI. Second, HCatalog API itself doesn’t ensure compatibility, it’s the Thrift protocol that matters. HCatalog is directly built upon raw Metastore API, and talks the same Metastore Thrift protocol. The problem we encountered in Spark SQL is that, usually we deploy Spark SQL Hive support with embedded mode (for testing) or local mode Metastore, and this makes us suffer from things like Metastore database schema changes. If Hive Metastore Thrift protocol is guaranteed to be downward compatible, then hopefully we can resort to remote mode Metastore and always depend on most recent Hive APIs. I had a glance of Thrift protocol version handling code in Hive, it seems that downward compatibility is not an issue. However I didn’t find any official documents about Thrift protocol compatibility. That said, in the future, hopefully we can only depend on most recent Hive dependencies and remove the Hive shim layer introduced in branch 1.2. For users who use exactly the same version of Hive as Spark SQL, they can use either remote or local/embedded Metastore; while for users who want to interact with existing legacy Hive clusters, they have to setup a remote Metastore and let the Thrift protocol to handle compatibility. — Cheng On 11/22/14 6:51 AM, Zhan Zhang wrote: Now Spark and hive integration is a very nice feature. But I am wondering what the long term roadmap is for spark integration with hive. Both of these two projects are undergoing fast improvement and changes. Currently, my understanding is that spark hive sql part relies on hive meta store and basic parser to operate, and the thrift-server intercept hive query and replace it with its own engine. With every release of hive, there need a significant effort on spark part to support it. For the metastore part, we may possibly replace it with hcatalog. But given the dependency of other parts on hive, e.g., metastore, thriftserver, hcatlog may not be able to help much. Does anyone have any insight or idea in mind? Thanks. Zhan Zhang -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org .
Re: How spark and hive integrate in long term?
Should emphasize that this is still a quick and rough conclusion, will investigate this in more detail after 1.2.0 release. Anyway we really like to provide Hive support in Spark SQL as smooth and clean as possible for both developers and end users. On 11/22/14 11:05 PM, Cheng Lian wrote: Hey Zhan, This is a great question. We are also seeking for a stable API/protocol that works with multiple Hive versions (esp. 0.12+). SPARK-4114 https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. Did some research into HCatalog recently, but I must confess that I’m not an expert on HCatalog, actually spent only 1 day on exploring it. So please don’t hesitate to correct me if I was wrong about the conclusions I made below. First, although HCatalog API is more pleasant to work with, it’s unfortunately feature incomplete. It only provides a subset of most commonly used operations. For example, |HCatCreateTableDesc| maps only a subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are missing. It’s also impossible to alter table properties via HCatalog API (Spark SQL uses this to implement the |ANALYZE| command). The |hcat| CLI tool provides all those features missing in HCatalog API via raw Metastore API, and is structurally similar to the old Hive CLI. Second, HCatalog API itself doesn’t ensure compatibility, it’s the Thrift protocol that matters. HCatalog is directly built upon raw Metastore API, and talks the same Metastore Thrift protocol. The problem we encountered in Spark SQL is that, usually we deploy Spark SQL Hive support with embedded mode (for testing) or local mode Metastore, and this makes us suffer from things like Metastore database schema changes. If Hive Metastore Thrift protocol is guaranteed to be downward compatible, then hopefully we can resort to remote mode Metastore and always depend on most recent Hive APIs. I had a glance of Thrift protocol version handling code in Hive, it seems that downward compatibility is not an issue. However I didn’t find any official documents about Thrift protocol compatibility. That said, in the future, hopefully we can only depend on most recent Hive dependencies and remove the Hive shim layer introduced in branch 1.2. For users who use exactly the same version of Hive as Spark SQL, they can use either remote or local/embedded Metastore; while for users who want to interact with existing legacy Hive clusters, they have to setup a remote Metastore and let the Thrift protocol to handle compatibility. — Cheng On 11/22/14 6:51 AM, Zhan Zhang wrote: Now Spark and hive integration is a very nice feature. But I am wondering what the long term roadmap is for spark integration with hive. Both of these two projects are undergoing fast improvement and changes. Currently, my understanding is that spark hive sql part relies on hive meta store and basic parser to operate, and the thrift-server intercept hive query and replace it with its own engine. With every release of hive, there need a significant effort on spark part to support it. For the metastore part, we may possibly replace it with hcatalog. But given the dependency of other parts on hive, e.g., metastore, thriftserver, hcatlog may not be able to help much. Does anyone have any insight or idea in mind? Thanks. Zhan Zhang -- View this message in context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail:dev-unsubscr...@spark.apache.org For additional commands, e-mail:dev-h...@spark.apache.org .
Re: Troubleshooting JVM OOM during Spark Unit Tests
Here’s that log file https://gist.github.com/nchammas/08d3a3a02486cf602ceb from a different run of the unit tests that also failed. I’m not sure what to look for. If it matters any, I also changed JAVA_OPTS as follows for this run: export JAVA_OPTS=-Xms512m -Xmx1024m -XX:PermSize=64m -XX:MaxPermSize=128m -Xss512k Nick On Sat Nov 22 2014 at 3:09:55 AM Reynold Xin r...@databricks.com http://mailto:r...@databricks.com wrote: What does /tmp/jvm-21940/hs_error.log tell you? It might give hints to what threads are allocating the extra off-heap memory. On Fri, Nov 21, 2014 at 1:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Howdy folks, I’m trying to understand why I’m getting “insufficient memory” errors when trying to run Spark Units tests within a CentOS Docker container. I’m building Spark and running the tests as follows: # build sbt/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive -Phive-thriftserver package assembly/assembly # Scala unit tests sbt/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive -Phive-thriftserver catalyst/test sql/test hive/test mllib/test The build completes successfully. After humming along for many minutes, the unit tests fail with this: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00074a58, 30932992, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 30932992 bytes for committing reserved memory. # An error report file with more information is saved as: # /tmp/jvm-21940/hs_error.log Exception in thread Thread-20 Exception in thread Thread-16 java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1$React.react(Framework.scala:945) at org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1.run(Framework.scala:934) at java.lang.Thread.run(Thread.java:745) java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.net.SocketInputStream.read(SocketInputStream.java:210) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at sbt.React.react(ForkTests.scala:114) at sbt.ForkTests$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74) at java.lang.Thread.run(Thread.java:745) Here are some (I think) relevant environment variables I have set: export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71-2.5.3.1.el7_0.x86_64 export JAVA_OPTS=-Xms128m -Xmx1g -XX:MaxPermSize=128m export MAVEN_OPTS=-Xmx512m -XX:MaxPermSize=128m How do I narrow down why this is happening? I know that running this thing within a Docker container may be playing a role here, but before poking around with Docker configs I want to make an effort at getting the Java setup right within the container. I’ve already tried giving the container 2GB of memory, so I don’t think at this point it’s a restriction on the container. Any pointers on how to narrow the problem down? Nick P.S. If you’re wondering why I’m trying to run unit tests within a Docker container, I’m exploring a different angle on SPARK-3431 https://issues.apache.org/jira/browse/SPARK-3431.
Re: How spark and hive integrate in long term?
Thanks Cheng for the insights. Regarding the HCatalog, I did some initial investigation too and agree with you. As of now, it seems not a good solution. I will try to talk to Hive people to see whether there is such guarantee for downward compatibility for thrift protocol. By the way, I tried some basic functions using hive-0.13 connect to hive-0.14 metastore, and it looks like they are compatible. Thanks. Zhan Zhang On Nov 22, 2014, at 7:14 AM, Cheng Lian lian.cs@gmail.com wrote: Should emphasize that this is still a quick and rough conclusion, will investigate this in more detail after 1.2.0 release. Anyway we really like to provide Hive support in Spark SQL as smooth and clean as possible for both developers and end users. On 11/22/14 11:05 PM, Cheng Lian wrote: Hey Zhan, This is a great question. We are also seeking for a stable API/protocol that works with multiple Hive versions (esp. 0.12+). SPARK-4114 https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. Did some research into HCatalog recently, but I must confess that I’m not an expert on HCatalog, actually spent only 1 day on exploring it. So please don’t hesitate to correct me if I was wrong about the conclusions I made below. First, although HCatalog API is more pleasant to work with, it’s unfortunately feature incomplete. It only provides a subset of most commonly used operations. For example, |HCatCreateTableDesc| maps only a subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are missing. It’s also impossible to alter table properties via HCatalog API (Spark SQL uses this to implement the |ANALYZE| command). The |hcat| CLI tool provides all those features missing in HCatalog API via raw Metastore API, and is structurally similar to the old Hive CLI. Second, HCatalog API itself doesn’t ensure compatibility, it’s the Thrift protocol that matters. HCatalog is directly built upon raw Metastore API, and talks the same Metastore Thrift protocol. The problem we encountered in Spark SQL is that, usually we deploy Spark SQL Hive support with embedded mode (for testing) or local mode Metastore, and this makes us suffer from things like Metastore database schema changes. If Hive Metastore Thrift protocol is guaranteed to be downward compatible, then hopefully we can resort to remote mode Metastore and always depend on most recent Hive APIs. I had a glance of Thrift protocol version handling code in Hive, it seems that downward compatibility is not an issue. However I didn’t find any official documents about Thrift protocol compatibility. That said, in the future, hopefully we can only depend on most recent Hive dependencies and remove the Hive shim layer introduced in branch 1.2. For users who use exactly the same version of Hive as Spark SQL, they can use either remote or local/embedded Metastore; while for users who want to interact with existing legacy Hive clusters, they have to setup a remote Metastore and let the Thrift protocol to handle compatibility. — Cheng On 11/22/14 6:51 AM, Zhan Zhang wrote: Now Spark and hive integration is a very nice feature. But I am wondering what the long term roadmap is for spark integration with hive. Both of these two projects are undergoing fast improvement and changes. Currently, my understanding is that spark hive sql part relies on hive meta store and basic parser to operate, and the thrift-server intercept hive query and replace it with its own engine. With every release of hive, there need a significant effort on spark part to support it. For the metastore part, we may possibly replace it with hcatalog. But given the dependency of other parts on hive, e.g., metastore, thriftserver, hcatlog may not be able to help much. Does anyone have any insight or idea in mind? Thanks. Zhan Zhang -- View this message in context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail:dev-unsubscr...@spark.apache.org For additional commands, e-mail:dev-h...@spark.apache.org . -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: sbt publish-local fails, missing spark-network-common
Can you update to latest master and see if this issue exists. On Nov 21, 2014 10:58 PM, pedrorodriguez ski.rodrig...@gmail.com wrote: Haven't found one yet, but work in AMPlab/at ampcamp so I will see if I can find someone who would know more about this (maybe reynold since he rolled out networking improvements for the PB sort). Good to have confirmation at least one other person is having problems with this rather than something isolated. -pedro -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/sbt-publish-local-fails-missing-spark-network-common-tp9471p9478.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: How spark and hive integrate in long term?
There are two distinct topics when it comes to hive integration. Part of the 1.3 roadmap will likely be better defining the plan for Hive integration as Hive adds future versions. 1. Ability to interact with Hive metastore's from different versions == I.e. if a user has a metastore, can Spark SQL read the data? This one we want need to solve by asking Hive for a stable metastore thrift API, or adding sufficient features to the HCatalog API so we can use that. 2. Compatibility with HQL over time as Hive adds new features. == This relates to how often we update our internal library dependency on Hive and/or build support for new Hive features internally. On Sat, Nov 22, 2014 at 10:01 AM, Zhan Zhang zzh...@hortonworks.com wrote: Thanks Cheng for the insights. Regarding the HCatalog, I did some initial investigation too and agree with you. As of now, it seems not a good solution. I will try to talk to Hive people to see whether there is such guarantee for downward compatibility for thrift protocol. By the way, I tried some basic functions using hive-0.13 connect to hive-0.14 metastore, and it looks like they are compatible. Thanks. Zhan Zhang On Nov 22, 2014, at 7:14 AM, Cheng Lian lian.cs@gmail.com wrote: Should emphasize that this is still a quick and rough conclusion, will investigate this in more detail after 1.2.0 release. Anyway we really like to provide Hive support in Spark SQL as smooth and clean as possible for both developers and end users. On 11/22/14 11:05 PM, Cheng Lian wrote: Hey Zhan, This is a great question. We are also seeking for a stable API/protocol that works with multiple Hive versions (esp. 0.12+). SPARK-4114 https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. Did some research into HCatalog recently, but I must confess that I'm not an expert on HCatalog, actually spent only 1 day on exploring it. So please don't hesitate to correct me if I was wrong about the conclusions I made below. First, although HCatalog API is more pleasant to work with, it's unfortunately feature incomplete. It only provides a subset of most commonly used operations. For example, |HCatCreateTableDesc| maps only a subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are missing. It's also impossible to alter table properties via HCatalog API (Spark SQL uses this to implement the |ANALYZE| command). The |hcat| CLI tool provides all those features missing in HCatalog API via raw Metastore API, and is structurally similar to the old Hive CLI. Second, HCatalog API itself doesn't ensure compatibility, it's the Thrift protocol that matters. HCatalog is directly built upon raw Metastore API, and talks the same Metastore Thrift protocol. The problem we encountered in Spark SQL is that, usually we deploy Spark SQL Hive support with embedded mode (for testing) or local mode Metastore, and this makes us suffer from things like Metastore database schema changes. If Hive Metastore Thrift protocol is guaranteed to be downward compatible, then hopefully we can resort to remote mode Metastore and always depend on most recent Hive APIs. I had a glance of Thrift protocol version handling code in Hive, it seems that downward compatibility is not an issue. However I didn't find any official documents about Thrift protocol compatibility. That said, in the future, hopefully we can only depend on most recent Hive dependencies and remove the Hive shim layer introduced in branch 1.2. For users who use exactly the same version of Hive as Spark SQL, they can use either remote or local/embedded Metastore; while for users who want to interact with existing legacy Hive clusters, they have to setup a remote Metastore and let the Thrift protocol to handle compatibility. -- Cheng On 11/22/14 6:51 AM, Zhan Zhang wrote: Now Spark and hive integration is a very nice feature. But I am wondering what the long term roadmap is for spark integration with hive. Both of these two projects are undergoing fast improvement and changes. Currently, my understanding is that spark hive sql part relies on hive meta store and basic parser to operate, and the thrift-server intercept hive query and replace it with its own engine. With every release of hive, there need a significant effort on spark part to support it. For the metastore part, we may possibly replace it with hcatalog. But given the dependency of other parts on hive, e.g., metastore, thriftserver, hcatlog may not be able to help much. Does anyone have any insight or idea in mind? Thanks. Zhan Zhang -- View this message in context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -
Re: Apache infra github sync down
Hi All, Unfortunately this went back down again. I've opened a new JIRA to track it: https://issues.apache.org/jira/browse/INFRA-8688 - Patrick On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, The Apache--github mirroring is not working right now and hasn't been working fo more than 24 hours. This means that pull requests will not appear as closed even though they have been merged. It also causes diffs to display incorrectly in some cases. If you'd like to follow progress by Apache infra on this issue you can watch this JIRA: https://issues.apache.org/jira/browse/INFRA-8654 - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org