RE: Unsupported Catalyst types in Parquet
Hi Alex, I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long. Thanks, Daoyuan -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com] Sent: Saturday, December 27, 2014 2:47 PM To: dev@spark.apache.org; Michael Armbrust Subject: Unsupported Catalyst types in Parquet Michael, I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support? Alex - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 1.2.0 build error
It means a test failed but you have not shown the test failure. This would have been logged earlier. You would need to say how you ran tests too. The tests for 1.2.0 pass for me on several common permutations. On Dec 29, 2014 3:22 AM, Naveen Madhire vmadh...@umail.iu.edu wrote: Hi, I am follow the below link for building Spark 1.2.0 https://spark.apache.org/docs/1.2.0/building-spark.html I am getting the below error during the Maven build. I am using IntelliJ IDE. The build is failing in the scalatest plugin, [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [3.355s] [INFO] Spark Project Networking .. SUCCESS [4.017s] [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [2.914s] [INFO] Spark Project Core FAILURE [9.678s] [INFO] Spark Project Bagel ... SKIPPED [INFO] Spark Project GraphX .. SKIPPED [ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project spark-core_2.10: There are test failures - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project spark-core_2.10: There are test failures at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:317) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:152) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:555) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:214) Is there something which I am missing during the build process. Please suggest. Using IntelliJ 13.1.6 Maven 3.1.1 Scala 2.10.4 Spark 1.2.0 Thanks Naveen
How to become spark developer in jira?
Hi devs, I'd like to ask what are the procedures/conditions for being assigned a role of a developer on spark jira? My motivation is to be able to assign issues to myself. Only related resource I have found is jira permission scheme [1]. regards Jakub [1] https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+ Scheme
RE: Unsupported Catalyst types in Parquet
Daoyuan, Thanks for creating the jiras. I need these features by... last week, so I'd be happy to take care of this myself, if only you or someone more experienced than me in the SparkSQL codebase could provide some guidance. Alex On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Alex, I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long. Thanks, Daoyuan -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com] Sent: Saturday, December 27, 2014 2:47 PM To: dev@spark.apache.org; Michael Armbrust Subject: Unsupported Catalyst types in Parquet Michael, I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support? Alex
Re: How to become spark developer in jira?
Please ask someone else to assign them for now, and just comment on them that you're working on them. Over time if you contribute a bunch we'll add you to that list. The problem is that in the past, people would assign issues to themselves and never actually work on them, making it confusing for others. Matei On Dec 29, 2014, at 7:59 AM, Jakub Dubovsky spark.dubovsky.ja...@seznam.cz wrote: Hi devs, I'd like to ask what are the procedures/conditions for being assigned a role of a developer on spark jira? My motivation is to be able to assign issues to myself. Only related resource I have found is jira permission scheme [1]. regards Jakub [1] https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+ Scheme - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: How to become spark developer in jira?
Hi Matei, that makes sense. Thanks a lot! Jakub -- Původní zpráva -- Od: Matei Zaharia matei.zaha...@gmail.com Komu: Jakub Dubovsky spark.dubovsky.ja...@seznam.cz Datum: 29. 12. 2014 19:31:57 Předmět: Re: How to become spark developer in jira? Please ask someone else to assign them for now, and just comment on them that you're working on them. Over time if you contribute a bunch we'll add you to that list. The problem is that in the past, people would assign issues to themselves and never actually work on them, making it confusing for others. Matei On Dec 29, 2014, at 7:59 AM, Jakub Dubovsky spark.dubovsky.jakub@seznam. cz wrote: Hi devs, I'd like to ask what are the procedures/conditions for being assigned a role of a developer on spark jira? My motivation is to be able to assign issues to myself. Only related resource I have found is jira permission scheme [1]. regards Jakub [1] https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+ Scheme - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Which committers care about Kafka?
Hey all, Some wrap up thoughts on this thread. Let me first reiterate what Patrick said, that Kafka is super super important as it forms the largest fraction of Spark Streaming user base. So we really want to improve the Kafka + Spark Streaming integration. To this end, some of the things that needs to be considered can be broadly classified into the following to sort facilitate the discussion. 1. Data rate control 2. Receiver failure semantics - partially achieving this gives at-least once, completely achieving this gives exactly-once 3. Driver failure semantics - partially achieving this gives at-least once, completely achieving this gives exactly-once Here is a run down of what is achieved by different implementations (based on what I think). 1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could handle 1 partially (some duplicate data), and could NOT handle 2 (all previously received data lost). 2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver can handle 3, can almost completely handle 1 and 2 (except few corner cases which prevents it from completely guaranteeing exactly-once). 3. I believe Dibyendu's solution (correct me if i am wrong) can handle 1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly completely solved by extending the solution further. 4. Cody's solution (again, correct me if I am wrong) does not use receivers at all (so eliminates 2). It can handle 3 completely for simple operations like map and filter, but not sure if it works completely for stateful ops like windows and updateStateByKey. Also it does not handle 1. The real challenge for Kafka is in achieving 3 completely for stateful operations while also handling 1. (i.e., use receivers, but still get driver failure guarantees). Solving this will give us our holy grail solution, and this is what I want to achieve. On that note, Cody submitted a PR on his style of achieving exactly-once semantics - https://github.com/apache/spark/pull/3798 . I am reviewing it. Please follow the PR if you are interested. TD On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org wrote: The conversation was mostly getting TD up to speed on this thread since he had just gotten back from his trip and hadn't seen it. The jira has a summary of the requirements we discussed, I'm sure TD or Patrick can add to the ticket if I missed something. On Dec 25, 2014 1:54 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: In general such discussions happen or is posted on the dev lists. Could you please post a summary? Thanks. Thanks, Hari On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org wrote: After a long talk with Patrick and TD (thanks guys), I opened the following jira https://issues.apache.org/jira/browse/SPARK-4964 Sample PR has an impementation for the batch and the dstream case, and a link to a project with example usage. On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com wrote: yup, we at tresata do the idempotent store the same way. very simple approach. On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org wrote: That KafkaRDD code is dead simple. Given a user specified map (topic1, partition0) - (startingOffset, endingOffset) (topic1, partition1) - (startingOffset, endingOffset) ... turn each one of those entries into a partition of an rdd, using the simple consumer. That's it. No recovery logic, no state, nothing - for any failures, bail on the rdd and let it retry. Spark stays out of the business of being a distributed database. The client code does any transformation it wants, then stores the data and offsets. There are two ways of doing this, either based on idempotence or a transactional data store. For idempotent stores: 1.manipulate data 2.save data to store 3.save ending offsets to the same store If you fail between 2 and 3, the offsets haven't been stored, you start again at the same beginning offsets, do the same calculations in the same order, overwrite the same data, all is good. For transactional stores: 1. manipulate data 2. begin transaction 3. save data to the store 4. save offsets 5. commit transaction If you fail before 5, the transaction rolls back. To make this less heavyweight, you can write the data outside the transaction and then update a pointer to the current data inside the transaction. Again, spark has nothing much to do with guaranteeing exactly once. In fact, the current streaming api actively impedes my ability to do the above. I'm just suggesting providing an api that doesn't get in the way of exactly-once. On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Can you explain your basic algorithm for the once-only-delivery? It is quite a bit of very Kafka-specific code, that would take more time to read than I can currently afford? If you can explain your
Re: Which committers care about Kafka?
Can you give a little more clarification on exactly what is meant by 1. Data rate control If someone wants to clamp the maximum number of messages per RDD partition in my solution, it would be very straightforward to do so. Regarding the holy grail, I'm pretty certain you can't have end-to-end transactional semantics without the client code being in charge of offset state. That means the client code is going to also need to be in charge of setting up an initial state for updateStateByKey that makes sense; as long as they can do that, the job should be safe to restart from arbitrary failures. On Mon, Dec 29, 2014 at 4:33 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hey all, Some wrap up thoughts on this thread. Let me first reiterate what Patrick said, that Kafka is super super important as it forms the largest fraction of Spark Streaming user base. So we really want to improve the Kafka + Spark Streaming integration. To this end, some of the things that needs to be considered can be broadly classified into the following to sort facilitate the discussion. 1. Data rate control 2. Receiver failure semantics - partially achieving this gives at-least once, completely achieving this gives exactly-once 3. Driver failure semantics - partially achieving this gives at-least once, completely achieving this gives exactly-once Here is a run down of what is achieved by different implementations (based on what I think). 1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could handle 1 partially (some duplicate data), and could NOT handle 2 (all previously received data lost). 2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver can handle 3, can almost completely handle 1 and 2 (except few corner cases which prevents it from completely guaranteeing exactly-once). 3. I believe Dibyendu's solution (correct me if i am wrong) can handle 1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly completely solved by extending the solution further. 4. Cody's solution (again, correct me if I am wrong) does not use receivers at all (so eliminates 2). It can handle 3 completely for simple operations like map and filter, but not sure if it works completely for stateful ops like windows and updateStateByKey. Also it does not handle 1. The real challenge for Kafka is in achieving 3 completely for stateful operations while also handling 1. (i.e., use receivers, but still get driver failure guarantees). Solving this will give us our holy grail solution, and this is what I want to achieve. On that note, Cody submitted a PR on his style of achieving exactly-once semantics - https://github.com/apache/spark/pull/3798 . I am reviewing it. Please follow the PR if you are interested. TD On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org wrote: The conversation was mostly getting TD up to speed on this thread since he had just gotten back from his trip and hadn't seen it. The jira has a summary of the requirements we discussed, I'm sure TD or Patrick can add to the ticket if I missed something. On Dec 25, 2014 1:54 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: In general such discussions happen or is posted on the dev lists. Could you please post a summary? Thanks. Thanks, Hari On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org wrote: After a long talk with Patrick and TD (thanks guys), I opened the following jira https://issues.apache.org/jira/browse/SPARK-4964 Sample PR has an impementation for the batch and the dstream case, and a link to a project with example usage. On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com wrote: yup, we at tresata do the idempotent store the same way. very simple approach. On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org wrote: That KafkaRDD code is dead simple. Given a user specified map (topic1, partition0) - (startingOffset, endingOffset) (topic1, partition1) - (startingOffset, endingOffset) ... turn each one of those entries into a partition of an rdd, using the simple consumer. That's it. No recovery logic, no state, nothing - for any failures, bail on the rdd and let it retry. Spark stays out of the business of being a distributed database. The client code does any transformation it wants, then stores the data and offsets. There are two ways of doing this, either based on idempotence or a transactional data store. For idempotent stores: 1.manipulate data 2.save data to store 3.save ending offsets to the same store If you fail between 2 and 3, the offsets haven't been stored, you start again at the same beginning offsets, do the same calculations in the same order, overwrite the same data, all is good. For transactional stores: 1. manipulate data 2. begin transaction 3. save
Re: Unsupported Catalyst types in Parquet
I'd love to get both of these in. There is some trickiness that I talk about on the JIRA for timestamps since the SQL timestamp class can support nano seconds and I don't think parquet has a type for this. Other systems (impala) seem to use INT96. It would be great to maybe ask on the parquet mailing list what the plan is there to make sure that whatever we do is going to be compatible long term. Michael On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta alexbare...@gmail.com wrote: Daoyuan, Thanks for creating the jiras. I need these features by... last week, so I'd be happy to take care of this myself, if only you or someone more experienced than me in the SparkSQL codebase could provide some guidance. Alex On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Alex, I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long. Thanks, Daoyuan -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com] Sent: Saturday, December 27, 2014 2:47 PM To: dev@spark.apache.org; Michael Armbrust Subject: Unsupported Catalyst types in Parquet Michael, I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support? Alex
Re: Spark 1.2.0 build error
I am getting The command is too long error. Is there anything which needs to be done. However for the time being I followed the sbt way of buidling spark in IntelliJ. On Mon, Dec 29, 2014 at 3:52 AM, Sean Owen so...@cloudera.com wrote: It means a test failed but you have not shown the test failure. This would have been logged earlier. You would need to say how you ran tests too. The tests for 1.2.0 pass for me on several common permutations. On Dec 29, 2014 3:22 AM, Naveen Madhire vmadh...@umail.iu.edu wrote: Hi, I am follow the below link for building Spark 1.2.0 https://spark.apache.org/docs/1.2.0/building-spark.html I am getting the below error during the Maven build. I am using IntelliJ IDE. The build is failing in the scalatest plugin, [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [3.355s] [INFO] Spark Project Networking .. SUCCESS [4.017s] [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [2.914s] [INFO] Spark Project Core FAILURE [9.678s] [INFO] Spark Project Bagel ... SKIPPED [INFO] Spark Project GraphX .. SKIPPED [ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project spark-core_2.10: There are test failures - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project spark-core_2.10: There are test failures at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:317) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:152) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:555) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:214) Is there something which I am missing during the build process. Please suggest. Using IntelliJ 13.1.6 Maven 3.1.1 Scala 2.10.4 Spark 1.2.0 Thanks Naveen
RE: Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy
Hi Patrick, I manually hardcoded the hive version to 0.13.1a and it works. It turns out that for some reason, 0.13.1 is being picked up instead of the 0.13.1a version from maven. So my solution was:hardcode the hive.version to 0.13.1a in my case since I am building it against hive 0.13 only, so the pom.xml was hardcoded with that version string, and the final JAR is working now with hive-exec 0.13.1a embed. Possible Reason why it didn't work?I suspect our internal environment is picking up 0.13.1 since we do use our own maven repo as a proxy and caching. 0.13.1a did appear in our own repo and it got replicated from the maven central repo, but during the build process, maven picked up 0.13.1 instead of 0.13.1a. Date: Wed, 10 Dec 2014 12:23:08 -0800 Subject: Re: Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy From: pwend...@gmail.com To: alee...@hotmail.com CC: dev@spark.apache.org Hi Andrew, It looks like somehow you are including jars from the upstream Apache Hive 0.13 project on your classpath. For Spark 1.2 Hive 0.13 support, we had to modify Hive to use a different version of Kryo that was compatible with Spark's Kryo version. https://github.com/pwendell/hive/commit/5b582f242946312e353cfce92fc3f3fa472aedf3 I would look through the actual classpath and make sure you aren't including your own hive-exec jar somehow. - Patrick On Wed, Dec 10, 2014 at 9:48 AM, Andrew Lee alee...@hotmail.com wrote: Apologize for the format, somehow it got messed up and linefeed were removed. Here's a reformatted version. Hi All, I tried to include necessary libraries in SPARK_CLASSPATH in spark-env.sh to include auxiliaries JARs and datanucleus*.jars from Hive, however, when I run HiveContext, it gives me the following error: Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy I have checked the JARs with (jar tf), looks like this is already included (shaded) in the assembly JAR (spark-assembly-1.2.0-hadoop2.4.1.jar) which is configured in the System classpath already. I couldn't figure out what is going on with the shading on the esotericsoftware JARs here. Any help is appreciated. How to reproduce the problem? Run the following 3 statements in spark-shell ( This is how I launched my spark-shell. cd /opt/spark; ./bin/spark-shell --master yarn --deploy-mode client --queue research --driver-memory 1024M) import org.apache.spark.SparkContext val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql(CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, value STRING)) A reference of my environment. Apache Hadoop 2.4.1 Apache Hive 0.13.1 Apache Spark branch-1.2 (installed under /opt/spark/, and config under /etc/spark/) Maven build command: mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install Source Code commit label: eb4d457a870f7a281dc0267db72715cd00245e82 My spark-env.sh have the following contents when I executed spark-shell: HADOOP_HOME=/opt/hadoop/ HIVE_HOME=/opt/hive/ HADOOP_CONF_DIR=/etc/hadoop/ YARN_CONF_DIR=/etc/hadoop/ HIVE_CONF_DIR=/etc/hive/ HADOOP_SNAPPY_JAR=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name snappy-java-*.jar) HADOOP_LZO_JAR=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name hadoop-lzo-*.jar) SPARK_YARN_DIST_FILES=/user/spark/libs/spark-assembly-1.2.0-hadoop2.4.1.jar export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_SNAPPY_JAR:$HADOOP_LZO_JAR:$HIVE_CONF_DIR:/opt/hive/lib/datanucleus-api-jdo-3.2.6.jar:/opt/hive/lib/datanucleus-core-3.2.10.jar:/opt/hive/lib/datanucleus-rdbms-3.2.9.jar Here's what I see from my stack trace. warning: there were 1 deprecation warning(s); re-run with -deprecation for details Hive history file=/home/hive/log/alti-test-01/hive_job_log_b5db9539-4736-44b3-a601-04fa77cb6730_1220828461.txt java.lang.NoClassDefFoundError: com/esotericsoftware/shaded/org/objenesis/strategy/InstantiatorStrategy at org.apache.hadoop.hive.ql.exec.Utilities.clinit(Utilities.java:925) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:9718) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:9712) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:434) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322) at
Adding third party jars to classpath used by pyspark
What is the recommended way to do this? We have some native database client libraries for which we are adding pyspark bindings. The pyspark invokes spark-submit. Do we add our libraries to the SPARK_SUBMIT_LIBRARY_PATH ? This issue relates back to an error we have been seeing Py4jError: Trying to call a package - the suspicion being that the third party libraries may not be available on the jvm side.
Re: Unsupported Catalyst types in Parquet
Michael, Actually, Adrian Wang already created pull requests for these issues. https://github.com/apache/spark/pull/3820 https://github.com/apache/spark/pull/3822 What do you think? Alex On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust mich...@databricks.com wrote: I'd love to get both of these in. There is some trickiness that I talk about on the JIRA for timestamps since the SQL timestamp class can support nano seconds and I don't think parquet has a type for this. Other systems (impala) seem to use INT96. It would be great to maybe ask on the parquet mailing list what the plan is there to make sure that whatever we do is going to be compatible long term. Michael On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta alexbare...@gmail.com wrote: Daoyuan, Thanks for creating the jiras. I need these features by... last week, so I'd be happy to take care of this myself, if only you or someone more experienced than me in the SparkSQL codebase could provide some guidance. Alex On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Alex, I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long. Thanks, Daoyuan -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com] Sent: Saturday, December 27, 2014 2:47 PM To: dev@spark.apache.org; Michael Armbrust Subject: Unsupported Catalyst types in Parquet Michael, I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support? Alex
RE: Which committers care about Kafka?
Hi Cody, From my understanding rate control is an optional configuration in Spark Streaming and is disabled by default, so user can reach maximum throughput without any configuration. The reason why rate control is so important in streaming processing is that Spark Streaming and other streaming frameworks are easily prone to unexpected behavior and failure situation due to network boost and other un-controllable inject rate. Especially for Spark Streaming, the large amount of processed data will delay the processing time, which will further delay the ongoing job, and finally lead to failure. Thanks Jerry From: Cody Koeninger [mailto:c...@koeninger.org] Sent: Tuesday, December 30, 2014 6:50 AM To: Tathagata Das Cc: Hari Shreedharan; Shao, Saisai; Sean McNamara; Patrick Wendell; Luis Ángel Vicente Sánchez; Dibyendu Bhattacharya; dev@spark.apache.org; Koert Kuipers Subject: Re: Which committers care about Kafka? Can you give a little more clarification on exactly what is meant by 1. Data rate control If someone wants to clamp the maximum number of messages per RDD partition in my solution, it would be very straightforward to do so. Regarding the holy grail, I'm pretty certain you can't have end-to-end transactional semantics without the client code being in charge of offset state. That means the client code is going to also need to be in charge of setting up an initial state for updateStateByKey that makes sense; as long as they can do that, the job should be safe to restart from arbitrary failures. On Mon, Dec 29, 2014 at 4:33 PM, Tathagata Das tathagata.das1...@gmail.commailto:tathagata.das1...@gmail.com wrote: Hey all, Some wrap up thoughts on this thread. Let me first reiterate what Patrick said, that Kafka is super super important as it forms the largest fraction of Spark Streaming user base. So we really want to improve the Kafka + Spark Streaming integration. To this end, some of the things that needs to be considered can be broadly classified into the following to sort facilitate the discussion. 1. Data rate control 2. Receiver failure semantics - partially achieving this gives at-least once, completely achieving this gives exactly-once 3. Driver failure semantics - partially achieving this gives at-least once, completely achieving this gives exactly-once Here is a run down of what is achieved by different implementations (based on what I think). 1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could handle 1 partially (some duplicate data), and could NOT handle 2 (all previously received data lost). 2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver can handle 3, can almost completely handle 1 and 2 (except few corner cases which prevents it from completely guaranteeing exactly-once). 3. I believe Dibyendu's solution (correct me if i am wrong) can handle 1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly completely solved by extending the solution further. 4. Cody's solution (again, correct me if I am wrong) does not use receivers at all (so eliminates 2). It can handle 3 completely for simple operations like map and filter, but not sure if it works completely for stateful ops like windows and updateStateByKey. Also it does not handle 1. The real challenge for Kafka is in achieving 3 completely for stateful operations while also handling 1. (i.e., use receivers, but still get driver failure guarantees). Solving this will give us our holy grail solution, and this is what I want to achieve. On that note, Cody submitted a PR on his style of achieving exactly-once semantics - https://github.com/apache/spark/pull/3798 . I am reviewing it. Please follow the PR if you are interested. TD On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.orgmailto:c...@koeninger.org wrote: The conversation was mostly getting TD up to speed on this thread since he had just gotten back from his trip and hadn't seen it. The jira has a summary of the requirements we discussed, I'm sure TD or Patrick can add to the ticket if I missed something. On Dec 25, 2014 1:54 AM, Hari Shreedharan hshreedha...@cloudera.commailto:hshreedha...@cloudera.com wrote: In general such discussions happen or is posted on the dev lists. Could you please post a summary? Thanks. Thanks, Hari On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.orgmailto:c...@koeninger.org wrote: After a long talk with Patrick and TD (thanks guys), I opened the following jira https://issues.apache.org/jira/browse/SPARK-4964 Sample PR has an impementation for the batch and the dstream case, and a link to a project with example usage. On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.commailto:ko...@tresata.com wrote: yup, we at tresata do the idempotent store the same way. very simple approach. On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.orgmailto:c...@koeninger.org wrote: That
Re: Adding third party jars to classpath used by pyspark
Hi Stephen, it should be enough to include --jars /path/to/file.jar in the command line call to either pyspark or spark-submit, as in spark-submit --master local --jars /path/to/file.jar myfile.py and you can check the bottom of the Web UI’s “Environment tab to make sure the jar gets on your classpath. Let me know if you still see errors related to this. — Jeremy - jeremyfreeman.net @thefreemanlab On Dec 29, 2014, at 7:55 PM, Stephen Boesch java...@gmail.com wrote: What is the recommended way to do this? We have some native database client libraries for which we are adding pyspark bindings. The pyspark invokes spark-submit. Do we add our libraries to the SPARK_SUBMIT_LIBRARY_PATH ? This issue relates back to an error we have been seeing Py4jError: Trying to call a package - the suspicion being that the third party libraries may not be available on the jvm side.
A question about using insert into in rdd foreach in spark 1.2
Hi All, I have a problem when I try to use insert into in loop, and this is my code def main(args: Array[String]) { //This is an empty table, schema is (Int,String) sqlContext.parquetFile(Data\\Test\\Parquet\\Temp).registerTempTable(temp) //not empty table, schema is (Int,String) val testData = sqlContext.parquetFile(Data\\Test\\Parquet\\2) testData.foreach{x= sqlContext.sql(INSERT INTO temp SELECT +x(0)+ ,'+x(1)+') } sqlContext.sql(select * from temp).collect().map(x=(x(0),x(1))).foreach(println) } when I run the code above in local mode, it will not stop and do not have error log. The lastest log is as follows: 14/12/30 11:07:44 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 14/12/30 11:07:44 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 200 records. 14/12/30 11:07:44 INFO InternalParquetRecordReader: at row 0. reading next block 14/12/30 11:07:44 INFO CodecPool: Got brand-new decompressor [.gz] 14/12/30 11:07:44 INFO InternalParquetRecordReader: block read in memory in 20 ms. row count = 200 14/12/30 11:07:45 INFO SparkContext: Starting job: runJob at ParquetTableOperations.scala:325 14/12/30 11:07:45 INFO DAGScheduler: Got job 1 (runJob at ParquetTableOperations.scala:325) with 1 output partitions (allowLocal=false) 14/12/30 11:07:45 INFO DAGScheduler: Final stage: Stage 1(runJob at ParquetTableOperations.scala:325) 14/12/30 11:07:45 INFO DAGScheduler: Parents of final stage: List() 14/12/30 11:07:45 INFO DAGScheduler: Missing parents: List() 14/12/30 11:07:45 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[6] at mapPartitions at basicOperators.scala:43), which has no missing parents 14/12/30 11:07:45 INFO MemoryStore: ensureFreeSpace(53328) called with curMem=239241, maxMem=1013836677 14/12/30 11:07:45 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 52.1 KB, free 966.6 MB) 14/12/30 11:07:45 INFO MemoryStore: ensureFreeSpace(31730) called with curMem=292569, maxMem=1013836677 14/12/30 11:07:45 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 31.0 KB, free 966.6 MB) 14/12/30 11:07:45 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:52533 (size: 31.0 KB, free: 966.8 MB) 14/12/30 11:07:45 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 14/12/30 11:07:45 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838 14/12/30 11:07:45 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[6] at mapPartitions at basicOperators.scala:43) 14/12/30 11:07:45 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks Can anyone give me a hand? Thanks evil -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-question-about-using-insert-into-in-rdd-foreach-in-spark-1-2-tp9959.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Help, pyspark.sql.List flatMap results become tuple
Hi pyspark guys, I have a json file, and its struct like below: {NAME:George, AGE:35, ADD_ID:1212, POSTAL_AREA:1, TIME_ZONE_ID:1, INTEREST:[{INTEREST_NO:1, INFO:x}, {INTEREST_NO:2, INFO:y}]} {NAME:John, AGE:45, ADD_ID:1213, POSTAL_AREA:1, TIME_ZONE_ID:1, INTEREST:[{INTEREST_NO:2, INFO:x}, {INTEREST_NO:3, INFO:y}]} I'm using spark sql api to manipulate the json data in pyspark shell, *sqlContext = SQLContext(sc) A400= sqlContext.jsonFile('jason_file_path')* /Row(ADD_ID=1212, AGE=35, INTEREST=[Row(INFO=u'x', INTEREST_NO=1), Row(INFO=u'y', INTEREST_NO=2)], NAME=u'George', POSTAL_AREA=1, TIME_ZONE_ID=1) Row(ADD_ID=1213, AGE=45, INTEREST=[Row(INFO=u'x', INTEREST_NO=2), Row(INFO=u'y', INTEREST_NO=3)], NAME=u'John', POSTAL_AREA=1, TIME_ZONE_ID=1)/ *X = A400.flatMap(lambda i: i.INTEREST)* The flatMap results like below, each element in json array were flatten to tuple, not my expected pyspark.sql.Row. I can only access the flatten results by index. but it supposed to be flatten to Row(namedTuple) and support to access by name. (u'x', 1) (u'y', 2) (u'x', 2) (u'y', 3) My spark version is 1.1. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Help, pyspark.sql.List flatMap results become tuple
named tuple degenerate to tuple. *A400.map(lambda i: map(None,i.INTEREST))* === [(u'x', 1), (u'y', 2)] [(u'x', 2), (u'y', 3)] -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961p9962.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Which committers care about Kafka?
Assuming you're talking about spark.streaming.receiver.maxRate, I just updated my PR to configure rate limiting based on that setting. So hopefully that's issue 1 sorted. Regarding issue 3, as far as I can tell regarding the odd semantics of stateful or windowed operations in the face of failure, my solution is no worse than existing classes such as FileStream that use inputdstream directly rather than a receiver. Can we get some specific cases that are a concern? Regarding the WAL solutions TD mentioned, one of the disadvantages of them is that they rely on checkpointing, unlike my approach. As I noted in this thread and in the jira ticket, I need something that can recover even when a checkpoint is lost, and I've already seen multiple situations in production where a checkpoint cannot be recovered (e.g. because code needs to be updated). On Mon, Dec 29, 2014 at 7:50 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi Cody, From my understanding rate control is an optional configuration in Spark Streaming and is disabled by default, so user can reach maximum throughput without any configuration. The reason why rate control is so important in streaming processing is that Spark Streaming and other streaming frameworks are easily prone to unexpected behavior and failure situation due to network boost and other un-controllable inject rate. Especially for Spark Streaming, the large amount of processed data will delay the processing time, which will further delay the ongoing job, and finally lead to failure. Thanks Jerry *From:* Cody Koeninger [mailto:c...@koeninger.org] *Sent:* Tuesday, December 30, 2014 6:50 AM *To:* Tathagata Das *Cc:* Hari Shreedharan; Shao, Saisai; Sean McNamara; Patrick Wendell; Luis Ángel Vicente Sánchez; Dibyendu Bhattacharya; dev@spark.apache.org; Koert Kuipers *Subject:* Re: Which committers care about Kafka? Can you give a little more clarification on exactly what is meant by 1. Data rate control If someone wants to clamp the maximum number of messages per RDD partition in my solution, it would be very straightforward to do so. Regarding the holy grail, I'm pretty certain you can't have end-to-end transactional semantics without the client code being in charge of offset state. That means the client code is going to also need to be in charge of setting up an initial state for updateStateByKey that makes sense; as long as they can do that, the job should be safe to restart from arbitrary failures. On Mon, Dec 29, 2014 at 4:33 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hey all, Some wrap up thoughts on this thread. Let me first reiterate what Patrick said, that Kafka is super super important as it forms the largest fraction of Spark Streaming user base. So we really want to improve the Kafka + Spark Streaming integration. To this end, some of the things that needs to be considered can be broadly classified into the following to sort facilitate the discussion. 1. Data rate control 2. Receiver failure semantics - partially achieving this gives at-least once, completely achieving this gives exactly-once 3. Driver failure semantics - partially achieving this gives at-least once, completely achieving this gives exactly-once Here is a run down of what is achieved by different implementations (based on what I think). 1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could handle 1 partially (some duplicate data), and could NOT handle 2 (all previously received data lost). 2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver can handle 3, can almost completely handle 1 and 2 (except few corner cases which prevents it from completely guaranteeing exactly-once). 3. I believe Dibyendu's solution (correct me if i am wrong) can handle 1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly completely solved by extending the solution further. 4. Cody's solution (again, correct me if I am wrong) does not use receivers at all (so eliminates 2). It can handle 3 completely for simple operations like map and filter, but not sure if it works completely for stateful ops like windows and updateStateByKey. Also it does not handle 1. The real challenge for Kafka is in achieving 3 completely for stateful operations while also handling 1. (i.e., use receivers, but still get driver failure guarantees). Solving this will give us our holy grail solution, and this is what I want to achieve. On that note, Cody submitted a PR on his style of achieving exactly-once semantics - https://github.com/apache/spark/pull/3798 . I am reviewing it. Please follow the PR if you are interested. TD On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org wrote: The conversation was mostly getting TD up to speed on this thread since he had just gotten back from his trip and hadn't seen it. The jira has a summary of the requirements we
Problems concerning implementing machine learning algorithm from scratch based on Spark
Hi all, I am trying to use some machine learning algorithms that are not included in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and I am using pyspark and Spark SQL. My problem is: I have some scripts that implement these algorithms, but I am not sure which part I shall change to make it fit into Big Data. - Like some very simple calculation may take much time if data is too big,but also constructing RDD or SQLContext table takes too much time. I am really not sure if I shall use map(), reduce() every time I need to make calculation. - Also, there are some matrix/array level calculation that can not be implemented easily merely using map(),reduce(), thus functions of the Numpy package shall be used. I am not sure when data is too big, and we simply use the numpy functions. Will it take too much time? I have found some scripts that are not from Mllib and was created by other developers(credits to Meethu Mathew from Flytxt, thanks for giving me insights!:)) Many thanks and look forward to getting feedbacks! Best, Danqing GMMSpark.py (7K) http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
RE: Unsupported Catalyst types in Parquet
By adding a flag in SQLContext, I have modified #3822 to include nanoseconds now. Since passing too many flags is ugly, now I need the whole SQLContext, so that we can put more flags there. Thanks, Daoyuan From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesday, December 30, 2014 10:43 AM To: Alessandro Baretta Cc: Wang, Daoyuan; dev@spark.apache.org Subject: Re: Unsupported Catalyst types in Parquet Yeah, I saw those. The problem is that #3822 truncates timestamps that include nanoseconds. On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta alexbare...@gmail.commailto:alexbare...@gmail.com wrote: Michael, Actually, Adrian Wang already created pull requests for these issues. https://github.com/apache/spark/pull/3820 https://github.com/apache/spark/pull/3822 What do you think? Alex On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust mich...@databricks.commailto:mich...@databricks.com wrote: I'd love to get both of these in. There is some trickiness that I talk about on the JIRA for timestamps since the SQL timestamp class can support nano seconds and I don't think parquet has a type for this. Other systems (impala) seem to use INT96. It would be great to maybe ask on the parquet mailing list what the plan is there to make sure that whatever we do is going to be compatible long term. Michael On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta alexbare...@gmail.commailto:alexbare...@gmail.com wrote: Daoyuan, Thanks for creating the jiras. I need these features by... last week, so I'd be happy to take care of this myself, if only you or someone more experienced than me in the SparkSQL codebase could provide some guidance. Alex On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote: Hi Alex, I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long. Thanks, Daoyuan -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.commailto:alexbare...@gmail.com] Sent: Saturday, December 27, 2014 2:47 PM To: dev@spark.apache.orgmailto:dev@spark.apache.org; Michael Armbrust Subject: Unsupported Catalyst types in Parquet Michael, I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support? Alex