Re: Why does SortShuffleWriter write to disk always?

2015-05-03 Thread Pramod Biligiri
Thanks for the info. I agree, it makes sense the way it is designed.

Pramod

On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan mri...@gmail.com
wrote:

 I agree, this is better handled by the filesystem cache - not to
 mention, being able to do zero copy writes.

 Regards,
 Mridul

 On Sat, May 2, 2015 at 10:26 PM, Reynold Xin r...@databricks.com wrote:
  I've personally prototyped completely in-memory shuffle for Spark 3
 times.
  However, it is unclear how big of a gain it would be to put all of these
 in
  memory, under newer file systems (ext4, xfs). If the shuffle data is
 small,
  they are still in the file system buffer cache anyway. Note that network
  throughput is often lower than disk throughput, so it won't be a problem
 to
  read them from disk. And not having to keep all of these stuff in-memory
  substantially simplifies memory management.
 
 
 
  On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri 
 pramodbilig...@gmail.com
  wrote:
 
  Hi,
  I was trying to see if I can make Spark avoid hitting the disk for small
  jobs, but I see that the SortShuffleWriter.write() always writes to
 disk. I
  found an older thread (
 
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
  )
  saying that it doesn't call fsync on this write path.
 
  My question is why does it always write to disk?
  Does it mean the reduce phase reads the result from the disk as well?
  Isn't it possible to read the data from map/buffer in ExternalSorter
  directly during the reduce phase?
 
  Thanks,
  Pramod
 



Re: [discuss] ending support for Java 6?

2015-05-03 Thread Sean Owen
Should be, but isn't what Jenkins does.
https://issues.apache.org/jira/browse/SPARK-1437

At this point it might be simpler to just decide that 1.5 will require
Java 7 and then the Jenkins setup is correct.

(NB: you can also solve this by setting bootclasspath to JDK 6 libs
even when using javac 7+ but I think this is overly complicated.)

On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan mri...@gmail.com wrote:
 Hi Shane,

   Since we are still maintaining support for jdk6, jenkins should be
 using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher
 api which breaks source level compat.
 -source and -target is insufficient to ensure api usage is conformant
 with the minimum jdk version we are supporting.

 Regards,
 Mridul

 [1] Not jdk7 as you mentioned

 On Sat, May 2, 2015 at 8:53 PM, shane knapp skn...@berkeley.edu wrote:
 that's kinda what we're doing right now, java 7 is the default/standard on
 our jenkins.

 or, i vote we buy a butler's outfit for thomas and have a second jenkins
 instance...  ;)

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



LDA and PageRank Using GraphX

2015-05-03 Thread Praveen Kumar Muthuswamy
Hi All,
I am looking to run LDA for topic modeling and page rank algorithms that
comes with GraphX
for some data analysis. Are there are any examples (GraphX) that I can take
a look  ?

Thanks
Praveen


Re: Speeding up Spark build during development

2015-05-03 Thread Mark Hamstra
https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn

On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com
wrote:

 This is great. I didn't know about the mvn script in the build directory.

 Pramod

 On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com
 
 wrote:

  Following what Ted said, if you leverage the `mvn` from within the
  `build/` directory of Spark you¹ll get zinc for free which should help
  speed up build times.
 
  On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  Pramod:
  Please remember to run Zinc so that the build is faster.
  
  Cheers
  
  On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
  alexander.ula...@hp.com
  wrote:
  
   Hi Pramod,
  
   For cluster-like tests you might want to use the same code as in
 mllib's
   LocalClusterSparkContext. You can rebuild only the package that you
  change
   and then run this main class.
  
   Best regards, Alexander
  
   -Original Message-
   From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
   Sent: Friday, May 01, 2015 1:46 AM
   To: dev@spark.apache.org
   Subject: Speeding up Spark build during development
  
   Hi,
   I'm making some small changes to the Spark codebase and trying it out
  on a
   cluster. I was wondering if there's a faster way to build than running
  the
   package target each time.
   Currently I'm using: mvn -DskipTests  package
  
   All the nodes have the same filesystem mounted at the same mount
 point.
  
   Pramod
  
 
  
 
  The information contained in this e-mail is confidential and/or
  proprietary to Capital One and/or its affiliates. The information
  transmitted herewith is intended only for use by the individual or entity
  to which it is addressed.  If the reader of this message is not the
  intended recipient, you are hereby notified that any review,
  retransmission, dissemination, distribution, copying or other use of, or
  taking of any action in reliance upon this information is strictly
  prohibited. If you have received this communication in error, please
  contact the sender and delete the material from your computer.
 
 



Re: Speeding up Spark build during development

2015-05-03 Thread Pramod Biligiri
This is great. I didn't know about the mvn script in the build directory.

Pramod

On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com
wrote:

 Following what Ted said, if you leverage the `mvn` from within the
 `build/` directory of Spark you¹ll get zinc for free which should help
 speed up build times.

 On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:

 Pramod:
 Please remember to run Zinc so that the build is faster.
 
 Cheers
 
 On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
 alexander.ula...@hp.com
 wrote:
 
  Hi Pramod,
 
  For cluster-like tests you might want to use the same code as in mllib's
  LocalClusterSparkContext. You can rebuild only the package that you
 change
  and then run this main class.
 
  Best regards, Alexander
 
  -Original Message-
  From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
  Sent: Friday, May 01, 2015 1:46 AM
  To: dev@spark.apache.org
  Subject: Speeding up Spark build during development
 
  Hi,
  I'm making some small changes to the Spark codebase and trying it out
 on a
  cluster. I was wondering if there's a faster way to build than running
 the
  package target each time.
  Currently I'm using: mvn -DskipTests  package
 
  All the nodes have the same filesystem mounted at the same mount point.
 
  Pramod
 

 

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.




Re: Submit Kill Spark Application program programmatically from another application

2015-05-03 Thread Chester Chen
Sounds like you are in Yarn-Cluster mode.

I created a JIRA SPARK-3913
https://issues.apache.org/jira/browse/SPARK-3913 and PR
https://github.com/apache/spark/pull/2786

is this what you looking for ?




Chester

On Sat, May 2, 2015 at 10:32 PM, Yijie Shen henry.yijies...@gmail.com
wrote:

 Hi,

 I’ve posted this problem in user@spark but find no reply, therefore moved
 to dev@spark, sorry for duplication.

 I am wondering if it is possible to submit, monitor  kill spark
 applications from another service.

 I have wrote a service this:

 parse user commands
 translate them into understandable arguments to an already prepared
 Spark-SQL application
 submit the application along with arguments to Spark Cluster
 using spark-submit from ProcessBuilder
 run generated applications' driver in cluster mode.
 The above 4 steps has been finished, but I have difficulties in these two:

 Query about the applications status, for example, the percentage
 completion.
 Kill queries accordingly
 What I find in spark standalone documentation suggest kill application
 using:

 ./bin/spark-class org.apache.spark.deploy.Client kill master url driver
 ID

 And should find
 the driver ID through the standalone Master web UI at
 http://master url:8080.

 Are there any programmatically methods I could get the driverID submitted
 by my `ProcessBuilder` and query status about the query?

 Any Suggestions?

 —
 Best Regards!
 Yijie Shen


Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-03 Thread Reynold Xin
We can't drop the existing createDataFrame one, since it breaks API
compatibility, and the existing one also automatically infers the column
name for case classes (in that case users most likely won't be declaring
names directly). If this is really a problem, we should just create a new
function (maybe more than one, since you could argue the one for Seq should
also have that ...).



On Sun, May 3, 2015 at 2:13 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 I have the perfect counter example where some of the data scientists
 prototype in Python and the production materials is done in Scala.
 But I get your point, as a matter of fact I realised the toDF method took
 parameters a little while after posting this.
 However the toDF still needs you to go from a List to an RDD, or create a
 useless Dataframe and call toDF on it re-creating a complete data
 structure. I just feel that the createDataFrame(_: Seq) is not really
 useful as it is, because I think there are practically no circumstances
 where you'd want to create a DataFrame without column names.

 I'm not implying a n-th overloaded method should be created, rather than
 change the signature of the existing method with an optional Seq of column
 names.

 Regards,

 Olivier.

 Le dim. 3 mai 2015 à 07:44, Reynold Xin r...@databricks.com a écrit :

 Part of the reason is that it is really easy to just call toDF on Scala,
 and we already have a lot of createDataFrame functions.

 (You might find some of the cross-language differences confusing, but I'd
 argue most real users just stick to one language, and developers or
 trainers are the only ones that need to constantly switch between
 languages).

 On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 SQLContext.createDataFrame has different behaviour in Scala or Python :

  l = [('Alice', 1)]
  sqlContext.createDataFrame(l).collect()
 [Row(_1=u'Alice', _2=1)]
  sqlContext.createDataFrame(l, ['name', 'age']).collect()
 [Row(name=u'Alice', age=1)]

 and in Scala :

 scala val data = List((Alice, 1), (Wonderland, 0))
 scala sqlContext.createDataFrame(data, List(name, score))
 console:28: error: overloaded method value createDataFrame with
 alternatives: ... cannot be applied to ...

 What do you think about allowing in Scala too to have a Seq of column
 names
 for the sake of consistency ?

 Regards,

 Olivier.





Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Reynold Xin
How does the pivotal format decides where to split the files? It seems to
me the challenge is to decide that, and on the top of my head the only way
to do this is to scan from the beginning and parse the json properly, which
makes it not possible with large files (doable for whole input with a lot
of small files though). If there is a better way, we should do it.


On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not available in
 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that sqlContext.jsonFile will
 not work for multi-line json(s))

 Regards,

 Olivier.



Question about PageRank with Live Journal

2015-05-03 Thread yunming zhang
Hi,

I have a question about running PageRan with live journal data as suggested
by the example at

org.apache.spark.examples.graphx.LiveJournalPageRank


I ran with the following options

bin/run-example org.apache.spark.examples.graphx.LiveJournalPageRank
data/graphx/soc-LiveJournal1.txt --numEPart=1


And it seems that from the SparkUI, the data that

mapPartitions at GraphImpl.scala:235

shuffle  read size is steadily increasing all the way to 2.1GB on a single
node machine. I think the shuffle read size should be decreasing as the
number of messages decrease?  I tried with 4 partitions and it seems that
the shuffle read for mapPartitions job is decreasing as the program
progresses. But I am not sure why it is actually increasing for one
partition?

And it really destroys the performance for a single partition even though
single partition uses much less time on reduce phase than the 4-partitions
configuration on a single node.


Thanks


Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
I'll try to study that and get back to you.
Regards,

Olivier.

Le lun. 4 mai 2015 à 04:05, Reynold Xin r...@databricks.com a écrit :

 How does the pivotal format decides where to split the files? It seems to
 me the challenge is to decide that, and on the top of my head the only way
 to do this is to scan from the beginning and parse the json properly, which
 makes it not possible with large files (doable for whole input with a lot
 of small files though). If there is a better way, we should do it.


 On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not available
 in
 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that sqlContext.jsonFile
 will
 not work for multi-line json(s))

 Regards,

 Olivier.





Blockers for 1.4.0

2015-05-03 Thread Sean Owen
I'd like to preemptively post the current list of 35 Blockers for
release 1.4.0.
(There are 53 Critical too, and a total of 273 JIRAs targeted for
1.4.0. Clearly most of that isn't accurate, so would be good to
un-target most of that.)

As a matter of process and hygiene, it would be best to either decide
they're not Blockers at this point and reprioritize, or focus on
addressing them, as we're now in the run up to release. I suggest that
we shouldn't release with any Blockers outstanding, by definition.


SPARK-7298
Web UI
Harmonize style of new UI visualizations
Patrick Wendell

SPARK-7297
Web UI
Make timeline more discoverable
Patrick Wendell

SPARK-7284
Documentation
 Streaming
Update streaming documentation for Spark 1.4.0 release
Tathagata Das

SPARK-7228
SparkR
SparkR public API for 1.4 release
Shivaram Venkataraman

SPARK-7158
SQL
collect and take return different results

SPARK-7139
Streaming
Allow received block metadata to be saved to WAL and recovered on driver failure
Tathagata Das

SPARK-7111
Streaming
Exposing of input data rates of non-receiver streams like Kafka Direct stream
Saisai Shao

SPARK-6941
SQL
Provide a better error message to explain that tables created from
RDDs are immutable

SPARK-6923
SQL
Spark SQL CLI does not read Data Source schema correctly

SPARK-6906
SQL
Refactor Connection to Hive Metastore
Michael Armbrust

SPARK-6831
Documentation
 PySpark
 SparkR
 SQL
Document how to use external data sources

SPARK-6824
SparkR
Fill the docs for DataFrame API in SparkR

SPARK-6812
SparkR
filter() on DataFrame does not work as expected

SPARK-6811
SparkR
Building binary R packages for SparkR

SPARK-6806
Documentation
 SparkR
SparkR examples in programming guide
Davies Liu

SPARK-6784
SQL
Clean up all the inbound/outbound conversions for DateType
Yin Huai

SPARK-6702
Streaming
 Web UI
Update the Streaming Tab in Spark UI to show more batch information
Tathagata Das

SPARK-6654
Streaming
Update Kinesis Streaming impls (both KCL-based and Direct) to use
latest aws-java-sdk and kinesis-client-library

SPARK-5960
Streaming
Allow AWS credentials to be passed to KinesisUtils.createStream()
Chris Fregly

SPARK-5948
SQL
Support writing to partitioned table for the Parquet data source

SPARK-5947
SQL
First class partitioning support in data sources API

SPARK-5920
Shuffle
Use a BufferedInputStream to read local shuffle data
Kay Ousterhout

SPARK-5707
SQL
Enabling spark.sql.codegen throws ClassNotFound exception

SPARK-5517
SQL
Add input types for Java UDFs

SPARK-5463
SQL
Fix Parquet filter push-down

SPARK-5456
SQL
Decimal Type comparison issue

SPARK-5182
SQL
Partitioning support for tables created by the data source API
Cheng Lian

SPARK-5180
SQL
Data source API improvement

SPARK-4867
SQL
UDF clean up

SPARK-2973
SQL
Use LocalRelation for all ExecutedCommands
 avoid job for take/collect()
Cheng Lian

SPARK-2883
Input/Output
 SQL
Spark Support for ORCFile format

SPARK-2873
SQL
Support disk spilling in Spark SQL aggregation
Yin Huai

SPARK-1517
Build
 Project Infra
Publish nightly snapshots of documentation
 maven artifacts
 and binary builds
Nicholas Chammas

SPARK-1442
SQL
Add Window function support

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 6?

2015-05-03 Thread shane knapp
that bug predates my time at the amplab...  :)

anyways, just to restate: jenkins currently only builds w/java 7.  if you
folks need 6, i can make it happen, but it will be a (smallish) bit of work.

shane

On Sun, May 3, 2015 at 2:14 AM, Sean Owen so...@cloudera.com wrote:

 Should be, but isn't what Jenkins does.
 https://issues.apache.org/jira/browse/SPARK-1437

 At this point it might be simpler to just decide that 1.5 will require
 Java 7 and then the Jenkins setup is correct.

 (NB: you can also solve this by setting bootclasspath to JDK 6 libs
 even when using javac 7+ but I think this is overly complicated.)

 On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
  Hi Shane,
 
Since we are still maintaining support for jdk6, jenkins should be
  using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher
  api which breaks source level compat.
  -source and -target is insufficient to ensure api usage is conformant
  with the minimum jdk version we are supporting.
 
  Regards,
  Mridul
 
  [1] Not jdk7 as you mentioned
 
  On Sat, May 2, 2015 at 8:53 PM, shane knapp skn...@berkeley.edu wrote:
  that's kinda what we're doing right now, java 7 is the default/standard
 on
  our jenkins.
 
  or, i vote we buy a butler's outfit for thomas and have a second jenkins
  instance...  ;)



Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-03 Thread Olivier Girardot
I have the perfect counter example where some of the data scientists
prototype in Python and the production materials is done in Scala.
But I get your point, as a matter of fact I realised the toDF method took
parameters a little while after posting this.
However the toDF still needs you to go from a List to an RDD, or create a
useless Dataframe and call toDF on it re-creating a complete data
structure. I just feel that the createDataFrame(_: Seq) is not really
useful as it is, because I think there are practically no circumstances
where you'd want to create a DataFrame without column names.

I'm not implying a n-th overloaded method should be created, rather than
change the signature of the existing method with an optional Seq of column
names.

Regards,

Olivier.

Le dim. 3 mai 2015 à 07:44, Reynold Xin r...@databricks.com a écrit :

 Part of the reason is that it is really easy to just call toDF on Scala,
 and we already have a lot of createDataFrame functions.

 (You might find some of the cross-language differences confusing, but I'd
 argue most real users just stick to one language, and developers or
 trainers are the only ones that need to constantly switch between
 languages).

 On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 SQLContext.createDataFrame has different behaviour in Scala or Python :

  l = [('Alice', 1)]
  sqlContext.createDataFrame(l).collect()
 [Row(_1=u'Alice', _2=1)]
  sqlContext.createDataFrame(l, ['name', 'age']).collect()
 [Row(name=u'Alice', age=1)]

 and in Scala :

 scala val data = List((Alice, 1), (Wonderland, 0))
 scala sqlContext.createDataFrame(data, List(name, score))
 console:28: error: overloaded method value createDataFrame with
 alternatives: ... cannot be applied to ...

 What do you think about allowing in Scala too to have a Seq of column
 names
 for the sake of consistency ?

 Regards,

 Olivier.





Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data efficiently, I
think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat

But it's rather inaccessible considering the dependency is not available in
any public maven repo (If you know of one, I'd be glad to hear it).

Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that sqlContext.jsonFile will
not work for multi-line json(s))

Regards,

Olivier.