date:20141229

RE: Unsupported Catalyst types in Parquet

2014-12-29 Thread Wang, Daoyuan

Hi Alex,

I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 
for timestamp type support. For decimal type, I think we only support decimals 
that fits in a long.

Thanks,
Daoyuan

-Original Message-
From: Alessandro Baretta [mailto:alexbare...@gmail.com] 
Sent: Saturday, December 27, 2014 2:47 PM
To: dev@spark.apache.org; Michael Armbrust
Subject: Unsupported Catalyst types in Parquet

Michael,

I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due 
to my RDDs having having DateType and DecimalType fields. What would it take to 
add Parquet support for these Catalyst? Are there any other Catalyst types for 
which there is no Catalyst support?

Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark 1.2.0 build error

2014-12-29 Thread Sean Owen

It means a test failed but you have not shown the test failure. This would
have been logged earlier. You would need to say how you ran tests too. The
tests for 1.2.0 pass for me on several common permutations.
On Dec 29, 2014 3:22 AM, Naveen Madhire vmadh...@umail.iu.edu wrote:

 Hi,

 I am follow the below link for building Spark 1.2.0

 https://spark.apache.org/docs/1.2.0/building-spark.html

 I am getting the below error during the Maven build. I am using IntelliJ
 IDE.

 The build is failing in the scalatest plugin,

 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. SUCCESS [3.355s]
 [INFO] Spark Project Networking .. SUCCESS [4.017s]
 [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [2.914s]
 [INFO] Spark Project Core  FAILURE [9.678s]
 [INFO] Spark Project Bagel ... SKIPPED
 [INFO] Spark Project GraphX .. SKIPPED



 [ERROR] Failed to execute goal
 org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
 spark-core_2.10: There are test failures - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
 spark-core_2.10: There are test failures
 at

 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
 at

 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at

 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at

 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at

 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at

 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at

 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:317)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:152)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:555)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:214)


 Is there something which I am missing during the build process. Please
 suggest.

 Using
 IntelliJ 13.1.6
 Maven 3.1.1
 Scala 2.10.4
 Spark 1.2.0


 Thanks
 Naveen

How to become spark developer in jira?

2014-12-29 Thread Jakub Dubovsky

Hi devs,

  I'd like to ask what are the procedures/conditions for being assigned a 
role of a developer on spark jira? My motivation is to be able to assign 
issues to myself. Only related resource I have found is jira permission 
scheme [1].

  regards
  Jakub

 [1] https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+
Scheme

RE: Unsupported Catalyst types in Parquet

2014-12-29 Thread Alessandro Baretta

Daoyuan,

Thanks for creating the jiras. I need these features by... last week, so
I'd be happy to take care of this myself, if only you or someone more
experienced than me in the SparkSQL codebase could provide some guidance.

Alex
On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote:

 Hi Alex,

 I'll create JIRA SPARK-4985 for date type support in parquet, and
 SPARK-4987 for timestamp type support. For decimal type, I think we only
 support decimals that fits in a long.

 Thanks,
 Daoyuan

 -Original Message-
 From: Alessandro Baretta [mailto:alexbare...@gmail.com]
 Sent: Saturday, December 27, 2014 2:47 PM
 To: dev@spark.apache.org; Michael Armbrust
 Subject: Unsupported Catalyst types in Parquet

 Michael,

 I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL,
 due to my RDDs having having DateType and DecimalType fields. What would it
 take to add Parquet support for these Catalyst? Are there any other
 Catalyst types for which there is no Catalyst support?

 Alex

Re: How to become spark developer in jira?

2014-12-29 Thread Matei Zaharia

Please ask someone else to assign them for now, and just comment on them that 
you're working on them. Over time if you contribute a bunch we'll add you to 
that list. The problem is that in the past, people would assign issues to 
themselves and never actually work on them, making it confusing for others.

Matei

 On Dec 29, 2014, at 7:59 AM, Jakub Dubovsky spark.dubovsky.ja...@seznam.cz 
 wrote:
 
 Hi devs,
 
   I'd like to ask what are the procedures/conditions for being assigned a 
 role of a developer on spark jira? My motivation is to be able to assign 
 issues to myself. Only related resource I have found is jira permission 
 scheme [1].
 
   regards
   Jakub
 
  [1] https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+
 Scheme


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: How to become spark developer in jira?

2014-12-29 Thread Jakub Dubovsky

Hi Matei,

  that makes sense. Thanks a lot!

  Jakub


-- Původní zpráva --
Od: Matei Zaharia matei.zaha...@gmail.com
Komu: Jakub Dubovsky spark.dubovsky.ja...@seznam.cz
Datum: 29. 12. 2014 19:31:57
Předmět: Re: How to become spark developer in jira?

Please ask someone else to assign them for now, and just comment on them 
that you're working on them. Over time if you contribute a bunch we'll add 
you to that list. The problem is that in the past, people would assign 
issues to themselves and never actually work on them, making it confusing 
for others.

Matei

 On Dec 29, 2014, at 7:59 AM, Jakub Dubovsky spark.dubovsky.jakub@seznam.
cz wrote:
 
 Hi devs,
 
 I'd like to ask what are the procedures/conditions for being assigned a 
 role of a developer on spark jira? My motivation is to be able to assign 
 issues to myself. Only related resource I have found is jira permission 
 scheme [1].
 
 regards
 Jakub
 
 [1] https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+
 Scheme


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Which committers care about Kafka?

2014-12-29 Thread Tathagata Das

Hey all,

Some wrap up thoughts on this thread.

Let me first reiterate what Patrick said, that Kafka is super super
important as it forms the largest fraction of Spark Streaming user
base. So we really want to improve the Kafka + Spark Streaming
integration. To this end, some of the things that needs to be
considered can be broadly classified into the following to sort
facilitate the discussion.

1. Data rate control
2. Receiver failure semantics - partially achieving this gives
at-least once, completely achieving this gives exactly-once
3. Driver failure semantics - partially achieving this gives at-least
once, completely achieving this gives exactly-once

Here is a run down of what is achieved by different implementations
(based on what I think).

1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could
handle 1 partially (some duplicate data), and could NOT handle 2 (all
previously received data lost).

2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver
can handle 3, can almost completely handle 1 and 2 (except few corner
cases which prevents it from completely guaranteeing exactly-once).

3. I believe Dibyendu's solution (correct me if i am wrong) can handle
1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly
completely solved by extending the solution further.

4. Cody's solution (again, correct me if I am wrong) does not use
receivers at all (so eliminates 2). It can handle 3 completely for
simple operations like map and filter, but not sure if it works
completely for stateful ops like windows and updateStateByKey. Also it
does not handle 1.

The real challenge for Kafka is in achieving 3 completely for stateful
operations while also handling 1.  (i.e., use receivers, but still get
driver failure guarantees). Solving this will give us our holy grail
solution, and this is what I want to achieve.

On that note, Cody submitted a PR on his style of achieving
exactly-once semantics - https://github.com/apache/spark/pull/3798 . I
am reviewing it. Please follow the PR if you are interested.

TD

On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org wrote:
 The conversation was mostly getting TD up to speed on this thread since he
 had just gotten back from his trip and hadn't seen it.

 The jira has a summary of the requirements we discussed, I'm sure TD or
 Patrick can add to the ticket if I missed something.
 On Dec 25, 2014 1:54 AM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 In general such discussions happen or is posted on the dev lists. Could
 you please post a summary? Thanks.

 Thanks,
 Hari


 On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org
 wrote:

  After a long talk with Patrick and TD (thanks guys), I opened the
 following jira

 https://issues.apache.org/jira/browse/SPARK-4964

 Sample PR has an impementation for the batch and the dstream case, and a
 link to a project with example usage.

 On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com wrote:

 yup, we at tresata do the idempotent store the same way. very simple
 approach.

 On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That KafkaRDD code is dead simple.

 Given a user specified map

 (topic1, partition0) - (startingOffset, endingOffset)
 (topic1, partition1) - (startingOffset, endingOffset)
 ...
 turn each one of those entries into a partition of an rdd, using the
 simple
 consumer.
 That's it.  No recovery logic, no state, nothing - for any failures,
 bail
 on the rdd and let it retry.
 Spark stays out of the business of being a distributed database.

 The client code does any transformation it wants, then stores the data
 and
 offsets.  There are two ways of doing this, either based on idempotence
 or
 a transactional data store.

 For idempotent stores:

 1.manipulate data
 2.save data to store
 3.save ending offsets to the same store

 If you fail between 2 and 3, the offsets haven't been stored, you start
 again at the same beginning offsets, do the same calculations in the
 same
 order, overwrite the same data, all is good.


 For transactional stores:

 1. manipulate data
 2. begin transaction
 3. save data to the store
 4. save offsets
 5. commit transaction

 If you fail before 5, the transaction rolls back.  To make this less
 heavyweight, you can write the data outside the transaction and then
 update
 a pointer to the current data inside the transaction.


 Again, spark has nothing much to do with guaranteeing exactly once.  In
 fact, the current streaming api actively impedes my ability to do the
 above.  I'm just suggesting providing an api that doesn't get in the
 way of
 exactly-once.





 On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan 
 hshreedha...@cloudera.com
  wrote:

  Can you explain your basic algorithm for the once-only-delivery? It is
  quite a bit of very Kafka-specific code, that would take more time to
 read
  than I can currently afford? If you can explain your

Re: Which committers care about Kafka?

2014-12-29 Thread Cody Koeninger

Can you give a little more clarification on exactly what is meant by

1. Data rate control

If someone wants to clamp the maximum number of messages per RDD partition
in my solution, it would be very straightforward to do so.

Regarding the holy grail, I'm pretty certain you can't have end-to-end
transactional semantics without the client code being in charge of offset
state.  That means the client code is going to also need to be in charge of
setting up an initial state for updateStateByKey that makes sense; as long
as they can do that, the job should be safe to restart from arbitrary
failures.

On Mon, Dec 29, 2014 at 4:33 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Hey all,

 Some wrap up thoughts on this thread.

 Let me first reiterate what Patrick said, that Kafka is super super
 important as it forms the largest fraction of Spark Streaming user
 base. So we really want to improve the Kafka + Spark Streaming
 integration. To this end, some of the things that needs to be
 considered can be broadly classified into the following to sort
 facilitate the discussion.

 1. Data rate control
 2. Receiver failure semantics - partially achieving this gives
 at-least once, completely achieving this gives exactly-once
 3. Driver failure semantics - partially achieving this gives at-least
 once, completely achieving this gives exactly-once

 Here is a run down of what is achieved by different implementations
 (based on what I think).

 1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could
 handle 1 partially (some duplicate data), and could NOT handle 2 (all
 previously received data lost).

 2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver
 can handle 3, can almost completely handle 1 and 2 (except few corner
 cases which prevents it from completely guaranteeing exactly-once).

 3. I believe Dibyendu's solution (correct me if i am wrong) can handle
 1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly
 completely solved by extending the solution further.

 4. Cody's solution (again, correct me if I am wrong) does not use
 receivers at all (so eliminates 2). It can handle 3 completely for
 simple operations like map and filter, but not sure if it works
 completely for stateful ops like windows and updateStateByKey. Also it
 does not handle 1.

 The real challenge for Kafka is in achieving 3 completely for stateful
 operations while also handling 1.  (i.e., use receivers, but still get
 driver failure guarantees). Solving this will give us our holy grail
 solution, and this is what I want to achieve.

 On that note, Cody submitted a PR on his style of achieving
 exactly-once semantics - https://github.com/apache/spark/pull/3798 . I
 am reviewing it. Please follow the PR if you are interested.

 TD

 On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org
 wrote:
  The conversation was mostly getting TD up to speed on this thread since
 he
  had just gotten back from his trip and hadn't seen it.
 
  The jira has a summary of the requirements we discussed, I'm sure TD or
  Patrick can add to the ticket if I missed something.
  On Dec 25, 2014 1:54 AM, Hari Shreedharan hshreedha...@cloudera.com
  wrote:
 
  In general such discussions happen or is posted on the dev lists. Could
  you please post a summary? Thanks.
 
  Thanks,
  Hari
 
 
  On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org
  wrote:
 
   After a long talk with Patrick and TD (thanks guys), I opened the
  following jira
 
  https://issues.apache.org/jira/browse/SPARK-4964
 
  Sample PR has an impementation for the batch and the dstream case, and
 a
  link to a project with example usage.
 
  On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
  yup, we at tresata do the idempotent store the same way. very simple
  approach.
 
  On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org
  wrote:
 
  That KafkaRDD code is dead simple.
 
  Given a user specified map
 
  (topic1, partition0) - (startingOffset, endingOffset)
  (topic1, partition1) - (startingOffset, endingOffset)
  ...
  turn each one of those entries into a partition of an rdd, using the
  simple
  consumer.
  That's it.  No recovery logic, no state, nothing - for any failures,
  bail
  on the rdd and let it retry.
  Spark stays out of the business of being a distributed database.
 
  The client code does any transformation it wants, then stores the
 data
  and
  offsets.  There are two ways of doing this, either based on
 idempotence
  or
  a transactional data store.
 
  For idempotent stores:
 
  1.manipulate data
  2.save data to store
  3.save ending offsets to the same store
 
  If you fail between 2 and 3, the offsets haven't been stored, you
 start
  again at the same beginning offsets, do the same calculations in the
  same
  order, overwrite the same data, all is good.
 
 
  For transactional stores:
 
  1. manipulate data
  2. begin transaction
  3. save

Re: Unsupported Catalyst types in Parquet

2014-12-29 Thread Michael Armbrust

I'd love to get both of these in.  There is some trickiness that I talk
about on the JIRA for timestamps since the SQL timestamp class can support
nano seconds and I don't think parquet has a type for this.  Other systems
(impala) seem to use INT96.  It would be great to maybe ask on the parquet
mailing list what the plan is there to make sure that whatever we do is
going to be compatible long term.

Michael

On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Daoyuan,

 Thanks for creating the jiras. I need these features by... last week, so
 I'd be happy to take care of this myself, if only you or someone more
 experienced than me in the SparkSQL codebase could provide some guidance.

 Alex
 On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote:

 Hi Alex,

 I'll create JIRA SPARK-4985 for date type support in parquet, and
 SPARK-4987 for timestamp type support. For decimal type, I think we only
 support decimals that fits in a long.

 Thanks,
 Daoyuan

 -Original Message-
 From: Alessandro Baretta [mailto:alexbare...@gmail.com]
 Sent: Saturday, December 27, 2014 2:47 PM
 To: dev@spark.apache.org; Michael Armbrust
 Subject: Unsupported Catalyst types in Parquet

 Michael,

 I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL,
 due to my RDDs having having DateType and DecimalType fields. What would it
 take to add Parquet support for these Catalyst? Are there any other
 Catalyst types for which there is no Catalyst support?

 Alex

Re: Spark 1.2.0 build error

2014-12-29 Thread Naveen Madhire

I am getting The command is too long error.

Is there anything which needs to be done.
However for the time being I followed the sbt way of buidling spark in
IntelliJ.

On Mon, Dec 29, 2014 at 3:52 AM, Sean Owen so...@cloudera.com wrote:

 It means a test failed but you have not shown the test failure. This would
 have been logged earlier. You would need to say how you ran tests too. The
 tests for 1.2.0 pass for me on several common permutations.
 On Dec 29, 2014 3:22 AM, Naveen Madhire vmadh...@umail.iu.edu wrote:

 Hi,

 I am follow the below link for building Spark 1.2.0

 https://spark.apache.org/docs/1.2.0/building-spark.html

 I am getting the below error during the Maven build. I am using IntelliJ
 IDE.

 The build is failing in the scalatest plugin,

 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. SUCCESS
 [3.355s]
 [INFO] Spark Project Networking .. SUCCESS
 [4.017s]
 [INFO] Spark Project Shuffle Streaming Service ... SUCCESS
 [2.914s]
 [INFO] Spark Project Core  FAILURE
 [9.678s]
 [INFO] Spark Project Bagel ... SKIPPED
 [INFO] Spark Project GraphX .. SKIPPED



 [ERROR] Failed to execute goal
 org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
 spark-core_2.10: There are test failures - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
 spark-core_2.10: There are test failures
 at

 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
 at

 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at

 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at

 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at

 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at

 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at

 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:317)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:152)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:555)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:214)


 Is there something which I am missing during the build process. Please
 suggest.

 Using
 IntelliJ 13.1.6
 Maven 3.1.1
 Scala 2.10.4
 Spark 1.2.0


 Thanks
 Naveen

RE: Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy

2014-12-29 Thread Andrew Lee

Hi Patrick,
I manually hardcoded the hive version to 0.13.1a and it works. It turns out 
that for some reason, 0.13.1 is being picked up instead of the 0.13.1a version 
from maven.
So my solution was:hardcode the hive.version to 0.13.1a in my case since I am 
building it against hive 0.13 only, so the pom.xml was hardcoded with that 
version string, and the final JAR is working now with hive-exec 0.13.1a embed.
Possible Reason why it didn't work?I suspect our internal environment is 
picking up 0.13.1 since we do use our own maven repo as a proxy and caching.  
0.13.1a did appear in our own repo and it got replicated from the maven central 
repo, but during the build process, maven picked up 0.13.1 instead of 0.13.1a.

 Date: Wed, 10 Dec 2014 12:23:08 -0800
 Subject: Re: Build Spark 1.2.0-rc1 encounter exceptions when running 
 HiveContext - Caused by: java.lang.ClassNotFoundException: 
 com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy
 From: pwend...@gmail.com
 To: alee...@hotmail.com
 CC: dev@spark.apache.org
 
 Hi Andrew,
 
 It looks like somehow you are including jars from the upstream Apache
 Hive 0.13 project on your classpath. For Spark 1.2 Hive 0.13 support,
 we had to modify Hive to use a different version of Kryo that was
 compatible with Spark's Kryo version.
 
 https://github.com/pwendell/hive/commit/5b582f242946312e353cfce92fc3f3fa472aedf3
 
 I would look through the actual classpath and make sure you aren't
 including your own hive-exec jar somehow.
 
 - Patrick
 
 On Wed, Dec 10, 2014 at 9:48 AM, Andrew Lee alee...@hotmail.com wrote:
  Apologize for the format, somehow it got messed up and linefeed were 
  removed. Here's a reformatted version.
  Hi All,
  I tried to include necessary libraries in SPARK_CLASSPATH in spark-env.sh 
  to include auxiliaries JARs and datanucleus*.jars from Hive, however, when 
  I run HiveContext, it gives me the following error:
 
  Caused by: java.lang.ClassNotFoundException: 
  com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy
 
  I have checked the JARs with (jar tf), looks like this is already included 
  (shaded) in the assembly JAR (spark-assembly-1.2.0-hadoop2.4.1.jar) which 
  is configured in the System classpath already. I couldn't figure out what 
  is going on with the shading on the esotericsoftware JARs here.  Any help 
  is appreciated.
 
 
  How to reproduce the problem?
  Run the following 3 statements in spark-shell ( This is how I launched my 
  spark-shell. cd /opt/spark; ./bin/spark-shell --master yarn --deploy-mode 
  client --queue research --driver-memory 1024M)
 
  import org.apache.spark.SparkContext
  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  hiveContext.hql(CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, 
  value STRING))
 
 
 
  A reference of my environment.
  Apache Hadoop 2.4.1
  Apache Hive 0.13.1
  Apache Spark branch-1.2 (installed under /opt/spark/, and config under 
  /etc/spark/)
  Maven build command:
 
  mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.4.1 
  -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install
 
  Source Code commit label: eb4d457a870f7a281dc0267db72715cd00245e82
 
  My spark-env.sh have the following contents when I executed spark-shell:
  HADOOP_HOME=/opt/hadoop/
  HIVE_HOME=/opt/hive/
  HADOOP_CONF_DIR=/etc/hadoop/
  YARN_CONF_DIR=/etc/hadoop/
  HIVE_CONF_DIR=/etc/hive/
  HADOOP_SNAPPY_JAR=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f 
  -name snappy-java-*.jar)
  HADOOP_LZO_JAR=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name 
  hadoop-lzo-*.jar)
  SPARK_YARN_DIST_FILES=/user/spark/libs/spark-assembly-1.2.0-hadoop2.4.1.jar
  export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
  export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
  export 
  SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_SNAPPY_JAR:$HADOOP_LZO_JAR:$HIVE_CONF_DIR:/opt/hive/lib/datanucleus-api-jdo-3.2.6.jar:/opt/hive/lib/datanucleus-core-3.2.10.jar:/opt/hive/lib/datanucleus-rdbms-3.2.9.jar
 
 
  Here's what I see from my stack trace.
  warning: there were 1 deprecation warning(s); re-run with -deprecation for 
  details
  Hive history 
  file=/home/hive/log/alti-test-01/hive_job_log_b5db9539-4736-44b3-a601-04fa77cb6730_1220828461.txt
  java.lang.NoClassDefFoundError: 
  com/esotericsoftware/shaded/org/objenesis/strategy/InstantiatorStrategy
at 
  org.apache.hadoop.hive.ql.exec.Utilities.clinit(Utilities.java:925)
at 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:9718)
at 
  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:9712)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:434)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
at

Adding third party jars to classpath used by pyspark

2014-12-29 Thread Stephen Boesch

What is the recommended way to do this?  We have some native database
client libraries for which we are adding pyspark bindings.

The pyspark invokes spark-submit.   Do we add our libraries to
the SPARK_SUBMIT_LIBRARY_PATH ?

This issue relates back to an error we have been seeing Py4jError: Trying
to call a package - the suspicion being that the third party libraries may
not be available on the jvm side.

Re: Unsupported Catalyst types in Parquet

2014-12-29 Thread Alessandro Baretta

Michael,

Actually, Adrian Wang already created pull requests for these issues.

https://github.com/apache/spark/pull/3820
https://github.com/apache/spark/pull/3822

What do you think?

Alex

On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust mich...@databricks.com
wrote:

 I'd love to get both of these in.  There is some trickiness that I talk
 about on the JIRA for timestamps since the SQL timestamp class can support
 nano seconds and I don't think parquet has a type for this.  Other systems
 (impala) seem to use INT96.  It would be great to maybe ask on the parquet
 mailing list what the plan is there to make sure that whatever we do is
 going to be compatible long term.

 Michael

 On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta alexbare...@gmail.com
  wrote:

 Daoyuan,

 Thanks for creating the jiras. I need these features by... last week, so
 I'd be happy to take care of this myself, if only you or someone more
 experienced than me in the SparkSQL codebase could provide some guidance.

 Alex
 On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote:

 Hi Alex,

 I'll create JIRA SPARK-4985 for date type support in parquet, and
 SPARK-4987 for timestamp type support. For decimal type, I think we only
 support decimals that fits in a long.

 Thanks,
 Daoyuan

 -Original Message-
 From: Alessandro Baretta [mailto:alexbare...@gmail.com]
 Sent: Saturday, December 27, 2014 2:47 PM
 To: dev@spark.apache.org; Michael Armbrust
 Subject: Unsupported Catalyst types in Parquet

 Michael,

 I'm having trouble storing my SchemaRDDs in Parquet format with
 SparkSQL, due to my RDDs having having DateType and DecimalType fields.
 What would it take to add Parquet support for these Catalyst? Are there any
 other Catalyst types for which there is no Catalyst support?

 Alex

RE: Which committers care about Kafka?

2014-12-29 Thread Shao, Saisai

Hi Cody,

From my understanding rate control is an optional configuration in Spark 
Streaming and is disabled by default, so user can reach maximum throughput 
without any configuration.

The reason why rate control is so important in streaming processing is that 
Spark Streaming and other streaming frameworks are easily prone to unexpected 
behavior and failure situation due to network boost and other un-controllable 
inject rate.

Especially for Spark Streaming,  the large amount of processed data will delay 
the processing time, which will further delay the ongoing job, and finally lead 
to failure.

Thanks
Jerry

From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: Tuesday, December 30, 2014 6:50 AM
To: Tathagata Das
Cc: Hari Shreedharan; Shao, Saisai; Sean McNamara; Patrick Wendell; Luis Ángel 
Vicente Sánchez; Dibyendu Bhattacharya; dev@spark.apache.org; Koert Kuipers
Subject: Re: Which committers care about Kafka?

Can you give a little more clarification on exactly what is meant by

1. Data rate control

If someone wants to clamp the maximum number of messages per RDD partition in 
my solution, it would be very straightforward to do so.

Regarding the holy grail, I'm pretty certain you can't have end-to-end 
transactional semantics without the client code being in charge of offset 
state.  That means the client code is going to also need to be in charge of 
setting up an initial state for updateStateByKey that makes sense; as long as 
they can do that, the job should be safe to restart from arbitrary failures.

On Mon, Dec 29, 2014 at 4:33 PM, Tathagata Das 
tathagata.das1...@gmail.commailto:tathagata.das1...@gmail.com wrote:
Hey all,

Some wrap up thoughts on this thread.

Let me first reiterate what Patrick said, that Kafka is super super
important as it forms the largest fraction of Spark Streaming user
base. So we really want to improve the Kafka + Spark Streaming
integration. To this end, some of the things that needs to be
considered can be broadly classified into the following to sort
facilitate the discussion.

1. Data rate control
2. Receiver failure semantics - partially achieving this gives
at-least once, completely achieving this gives exactly-once
3. Driver failure semantics - partially achieving this gives at-least
once, completely achieving this gives exactly-once

Here is a run down of what is achieved by different implementations
(based on what I think).

1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could
handle 1 partially (some duplicate data), and could NOT handle 2 (all
previously received data lost).

2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver
can handle 3, can almost completely handle 1 and 2 (except few corner
cases which prevents it from completely guaranteeing exactly-once).

3. I believe Dibyendu's solution (correct me if i am wrong) can handle
1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly
completely solved by extending the solution further.

4. Cody's solution (again, correct me if I am wrong) does not use
receivers at all (so eliminates 2). It can handle 3 completely for
simple operations like map and filter, but not sure if it works
completely for stateful ops like windows and updateStateByKey. Also it
does not handle 1.

The real challenge for Kafka is in achieving 3 completely for stateful
operations while also handling 1.  (i.e., use receivers, but still get
driver failure guarantees). Solving this will give us our holy grail
solution, and this is what I want to achieve.

On that note, Cody submitted a PR on his style of achieving
exactly-once semantics - https://github.com/apache/spark/pull/3798 . I
am reviewing it. Please follow the PR if you are interested.

TD

On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger 
c...@koeninger.orgmailto:c...@koeninger.org wrote:
 The conversation was mostly getting TD up to speed on this thread since he
 had just gotten back from his trip and hadn't seen it.

 The jira has a summary of the requirements we discussed, I'm sure TD or
 Patrick can add to the ticket if I missed something.
 On Dec 25, 2014 1:54 AM, Hari Shreedharan 
 hshreedha...@cloudera.commailto:hshreedha...@cloudera.com
 wrote:

 In general such discussions happen or is posted on the dev lists. Could
 you please post a summary? Thanks.

 Thanks,
 Hari


 On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger 
 c...@koeninger.orgmailto:c...@koeninger.org
 wrote:

  After a long talk with Patrick and TD (thanks guys), I opened the
 following jira

 https://issues.apache.org/jira/browse/SPARK-4964

 Sample PR has an impementation for the batch and the dstream case, and a
 link to a project with example usage.

 On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers 
 ko...@tresata.commailto:ko...@tresata.com wrote:

 yup, we at tresata do the idempotent store the same way. very simple
 approach.

 On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger 
 c...@koeninger.orgmailto:c...@koeninger.org
 wrote:

 That

Re: Adding third party jars to classpath used by pyspark

2014-12-29 Thread Jeremy Freeman

Hi Stephen, it should be enough to include 

 --jars /path/to/file.jar

in the command line call to either pyspark or spark-submit, as in

 spark-submit --master local --jars /path/to/file.jar myfile.py

and you can check the bottom of the Web UI’s “Environment tab to make sure the 
jar gets on your classpath. Let me know if you still see errors related to this.

— Jeremy

-
jeremyfreeman.net
@thefreemanlab

On Dec 29, 2014, at 7:55 PM, Stephen Boesch java...@gmail.com wrote:

 What is the recommended way to do this?  We have some native database
 client libraries for which we are adding pyspark bindings.
 
 The pyspark invokes spark-submit.   Do we add our libraries to
 the SPARK_SUBMIT_LIBRARY_PATH ?
 
 This issue relates back to an error we have been seeing Py4jError: Trying
 to call a package - the suspicion being that the third party libraries may
 not be available on the jvm side.

A question about using insert into in rdd foreach in spark 1.2

2014-12-29 Thread evil

Hi All,
I have a  problem when I try to use insert into in loop, and this is my code  
def main(args: Array[String]) {
//This is an empty table, schema is (Int,String)
   
sqlContext.parquetFile(Data\\Test\\Parquet\\Temp).registerTempTable(temp)
//not empty table,  schema is (Int,String)
val testData = sqlContext.parquetFile(Data\\Test\\Parquet\\2)
testData.foreach{x=
  sqlContext.sql(INSERT INTO temp SELECT +x(0)+ ,'+x(1)+')
}
sqlContext.sql(select * from
temp).collect().map(x=(x(0),x(1))).foreach(println)
  }

when I run the code above in local mode, it will not stop and do not have
error log. The lastest log is as follows:
14/12/30 11:07:44 WARN ParquetRecordReader: Can not initialize counter due
to context is not a instance of TaskInputOutputContext, but is
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
14/12/30 11:07:44 INFO InternalParquetRecordReader: RecordReader initialized
will read a total of 200 records.
14/12/30 11:07:44 INFO InternalParquetRecordReader: at row 0. reading next
block
14/12/30 11:07:44 INFO CodecPool: Got brand-new decompressor [.gz]
14/12/30 11:07:44 INFO InternalParquetRecordReader: block read in memory in
20 ms. row count = 200
14/12/30 11:07:45 INFO SparkContext: Starting job: runJob at
ParquetTableOperations.scala:325
14/12/30 11:07:45 INFO DAGScheduler: Got job 1 (runJob at
ParquetTableOperations.scala:325) with 1 output partitions
(allowLocal=false)
14/12/30 11:07:45 INFO DAGScheduler: Final stage: Stage 1(runJob at
ParquetTableOperations.scala:325)
14/12/30 11:07:45 INFO DAGScheduler: Parents of final stage: List()
14/12/30 11:07:45 INFO DAGScheduler: Missing parents: List()
14/12/30 11:07:45 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[6]
at mapPartitions at basicOperators.scala:43), which has no missing parents
14/12/30 11:07:45 INFO MemoryStore: ensureFreeSpace(53328) called with
curMem=239241, maxMem=1013836677
14/12/30 11:07:45 INFO MemoryStore: Block broadcast_2 stored as values in
memory (estimated size 52.1 KB, free 966.6 MB)
14/12/30 11:07:45 INFO MemoryStore: ensureFreeSpace(31730) called with
curMem=292569, maxMem=1013836677
14/12/30 11:07:45 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes
in memory (estimated size 31.0 KB, free 966.6 MB)
14/12/30 11:07:45 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on localhost:52533 (size: 31.0 KB, free: 966.8 MB)
14/12/30 11:07:45 INFO BlockManagerMaster: Updated info of block
broadcast_2_piece0
14/12/30 11:07:45 INFO SparkContext: Created broadcast 2 from broadcast at
DAGScheduler.scala:838
14/12/30 11:07:45 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1
(MapPartitionsRDD[6] at mapPartitions at basicOperators.scala:43)
14/12/30 11:07:45 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks

Can anyone give me a hand?

Thanks
evil



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-question-about-using-insert-into-in-rdd-foreach-in-spark-1-2-tp9959.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Help, pyspark.sql.List flatMap results become tuple

2014-12-29 Thread guoxu1231

Hi pyspark guys, 

I have a json file, and its struct like below:

{NAME:George, AGE:35, ADD_ID:1212, POSTAL_AREA:1,
TIME_ZONE_ID:1, INTEREST:[{INTEREST_NO:1, INFO:x},
{INTEREST_NO:2, INFO:y}]}
{NAME:John, AGE:45, ADD_ID:1213, POSTAL_AREA:1, TIME_ZONE_ID:1,
INTEREST:[{INTEREST_NO:2, INFO:x}, {INTEREST_NO:3, INFO:y}]}

I'm using spark sql api to manipulate the json data in pyspark shell, 

*sqlContext = SQLContext(sc)
A400= sqlContext.jsonFile('jason_file_path')*
/Row(ADD_ID=1212, AGE=35, INTEREST=[Row(INFO=u'x', INTEREST_NO=1),
Row(INFO=u'y', INTEREST_NO=2)], NAME=u'George', POSTAL_AREA=1,
TIME_ZONE_ID=1)
Row(ADD_ID=1213, AGE=45, INTEREST=[Row(INFO=u'x', INTEREST_NO=2),
Row(INFO=u'y', INTEREST_NO=3)], NAME=u'John', POSTAL_AREA=1,
TIME_ZONE_ID=1)/
*X = A400.flatMap(lambda i: i.INTEREST)*
The flatMap results like below, each element in json array were flatten to
tuple, not my expected  pyspark.sql.Row. I can only access the flatten
results by index. but it supposed to be flatten to Row(namedTuple) and
support to access by name.
(u'x', 1)
(u'y', 2)
(u'x', 2)
(u'y', 3)

My spark version is 1.1.







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Help, pyspark.sql.List flatMap results become tuple

2014-12-29 Thread guoxu1231

named tuple degenerate to tuple. 
*A400.map(lambda i: map(None,i.INTEREST))*
===
[(u'x', 1), (u'y', 2)]
[(u'x', 2), (u'y', 3)]



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961p9962.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Which committers care about Kafka?

2014-12-29 Thread Cody Koeninger

Assuming you're talking about spark.streaming.receiver.maxRate, I just
updated my PR to configure rate limiting based on that setting.  So
hopefully that's issue 1 sorted.

Regarding issue 3, as far as I can tell regarding the odd semantics of
stateful or windowed operations in the face of failure, my solution is no
worse than existing classes such as FileStream that use inputdstream
directly rather than a receiver.  Can we get some specific cases that are a
concern?

Regarding the WAL solutions TD mentioned, one of the disadvantages of them
is that they rely on checkpointing, unlike my approach.  As I noted in this
thread and in the jira ticket, I need something that can recover even when
a checkpoint is lost, and I've already seen multiple situations in
production where a checkpoint cannot be recovered (e.g. because code needs
to be updated).

On Mon, Dec 29, 2014 at 7:50 PM, Shao, Saisai saisai.s...@intel.com wrote:

  Hi Cody,



 From my understanding rate control is an optional configuration in Spark
 Streaming and is disabled by default, so user can reach maximum throughput
 without any configuration.



 The reason why rate control is so important in streaming processing is
 that Spark Streaming and other streaming frameworks are easily prone to
 unexpected behavior and failure situation due to network boost and other
 un-controllable inject rate.



 Especially for Spark Streaming,  the large amount of processed data will
 delay the processing time, which will further delay the ongoing job, and
 finally lead to failure.



 Thanks

 Jerry



 *From:* Cody Koeninger [mailto:c...@koeninger.org]
 *Sent:* Tuesday, December 30, 2014 6:50 AM
 *To:* Tathagata Das
 *Cc:* Hari Shreedharan; Shao, Saisai; Sean McNamara; Patrick Wendell;
 Luis Ángel Vicente Sánchez; Dibyendu Bhattacharya; dev@spark.apache.org;
 Koert Kuipers

 *Subject:* Re: Which committers care about Kafka?



 Can you give a little more clarification on exactly what is meant by



 1. Data rate control



 If someone wants to clamp the maximum number of messages per RDD partition
 in my solution, it would be very straightforward to do so.



 Regarding the holy grail, I'm pretty certain you can't have end-to-end
 transactional semantics without the client code being in charge of offset
 state.  That means the client code is going to also need to be in charge of
 setting up an initial state for updateStateByKey that makes sense; as long
 as they can do that, the job should be safe to restart from arbitrary
 failures.



 On Mon, Dec 29, 2014 at 4:33 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Hey all,

 Some wrap up thoughts on this thread.

 Let me first reiterate what Patrick said, that Kafka is super super
 important as it forms the largest fraction of Spark Streaming user
 base. So we really want to improve the Kafka + Spark Streaming
 integration. To this end, some of the things that needs to be
 considered can be broadly classified into the following to sort
 facilitate the discussion.

 1. Data rate control
 2. Receiver failure semantics - partially achieving this gives
 at-least once, completely achieving this gives exactly-once
 3. Driver failure semantics - partially achieving this gives at-least
 once, completely achieving this gives exactly-once

 Here is a run down of what is achieved by different implementations
 (based on what I think).

 1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could
 handle 1 partially (some duplicate data), and could NOT handle 2 (all
 previously received data lost).

 2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver
 can handle 3, can almost completely handle 1 and 2 (except few corner
 cases which prevents it from completely guaranteeing exactly-once).

 3. I believe Dibyendu's solution (correct me if i am wrong) can handle
 1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly
 completely solved by extending the solution further.

 4. Cody's solution (again, correct me if I am wrong) does not use
 receivers at all (so eliminates 2). It can handle 3 completely for
 simple operations like map and filter, but not sure if it works
 completely for stateful ops like windows and updateStateByKey. Also it
 does not handle 1.

 The real challenge for Kafka is in achieving 3 completely for stateful
 operations while also handling 1.  (i.e., use receivers, but still get
 driver failure guarantees). Solving this will give us our holy grail
 solution, and this is what I want to achieve.

 On that note, Cody submitted a PR on his style of achieving
 exactly-once semantics - https://github.com/apache/spark/pull/3798 . I
 am reviewing it. Please follow the PR if you are interested.

 TD


 On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org
 wrote:
  The conversation was mostly getting TD up to speed on this thread since
 he
  had just gotten back from his trip and hadn't seen it.
 
  The jira has a summary of the requirements we

Problems concerning implementing machine learning algorithm from scratch based on Spark

2014-12-29 Thread danqing0703

Hi all,

I am trying to use some machine learning algorithms that are not included
in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and
I am using pyspark and Spark SQL.

My problem is: I have some scripts that implement these algorithms, but I
am not sure which part I shall change to make it fit into Big Data.

   - Like some very simple calculation may take much time if data is too
   big,but also constructing RDD or SQLContext table takes too much time. I am
   really not sure if I shall use map(), reduce() every time I need to make
   calculation.
   - Also, there are some matrix/array level calculation that can not be
   implemented easily merely using map(),reduce(), thus functions of the Numpy
   package shall be used. I am not sure when data is too big, and we simply
   use the numpy functions. Will it take too much time?

I have found some scripts that are not from Mllib and was created by other
developers(credits to Meethu Mathew from Flytxt, thanks for giving me
insights!:))

Many thanks and look forward to getting feedbacks!

Best, Danqing


GMMSpark.py (7K) 
http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: Unsupported Catalyst types in Parquet

2014-12-29 Thread Wang, Daoyuan

By adding a flag in SQLContext, I have modified #3822 to include nanoseconds 
now. Since passing too many flags is ugly, now I need the whole SQLContext, so 
that we can put more flags there.

Thanks,
Daoyuan

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Tuesday, December 30, 2014 10:43 AM
To: Alessandro Baretta
Cc: Wang, Daoyuan; dev@spark.apache.org
Subject: Re: Unsupported Catalyst types in Parquet

Yeah, I saw those.  The problem is that #3822 truncates timestamps that include 
nanoseconds.

On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta 
alexbare...@gmail.commailto:alexbare...@gmail.com wrote:
Michael,

Actually, Adrian Wang already created pull requests for these issues.

https://github.com/apache/spark/pull/3820
https://github.com/apache/spark/pull/3822

What do you think?

Alex

On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust 
mich...@databricks.commailto:mich...@databricks.com wrote:
I'd love to get both of these in.  There is some trickiness that I talk about 
on the JIRA for timestamps since the SQL timestamp class can support nano 
seconds and I don't think parquet has a type for this.  Other systems (impala) 
seem to use INT96.  It would be great to maybe ask on the parquet mailing list 
what the plan is there to make sure that whatever we do is going to be 
compatible long term.

Michael

On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta 
alexbare...@gmail.commailto:alexbare...@gmail.com wrote:

Daoyuan,

Thanks for creating the jiras. I need these features by... last week, so I'd be 
happy to take care of this myself, if only you or someone more experienced than 
me in the SparkSQL codebase could provide some guidance.

Alex
On Dec 29, 2014 12:06 AM, Wang, Daoyuan 
daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote:
Hi Alex,

I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 
for timestamp type support. For decimal type, I think we only support decimals 
that fits in a long.

Thanks,
Daoyuan

-Original Message-
From: Alessandro Baretta 
[mailto:alexbare...@gmail.commailto:alexbare...@gmail.com]
Sent: Saturday, December 27, 2014 2:47 PM
To: dev@spark.apache.orgmailto:dev@spark.apache.org; Michael Armbrust
Subject: Unsupported Catalyst types in Parquet

Michael,

I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due 
to my RDDs having having DateType and DecimalType fields. What would it take to 
add Parquet support for these Catalyst? Are there any other Catalyst types for 
which there is no Catalyst support?

Alex

RE: Unsupported Catalyst types in Parquet

Re: Spark 1.2.0 build error

How to become spark developer in jira?

RE: Unsupported Catalyst types in Parquet

Re: How to become spark developer in jira?

Re: How to become spark developer in jira?

Re: Which committers care about Kafka?

Re: Which committers care about Kafka?

Re: Unsupported Catalyst types in Parquet

Re: Spark 1.2.0 build error

RE: Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy

Adding third party jars to classpath used by pyspark

Re: Unsupported Catalyst types in Parquet

RE: Which committers care about Kafka?

Re: Adding third party jars to classpath used by pyspark

A question about using insert into in rdd foreach in spark 1.2

Help, pyspark.sql.List flatMap results become tuple

Re: Help, pyspark.sql.List flatMap results become tuple

Re: Which committers care about Kafka?

Problems concerning implementing machine learning algorithm from scratch based on Spark

RE: Unsupported Catalyst types in Parquet

21 matches

Site Navigation

Mail list logo

Footer information