Re: Why does SortShuffleWriter write to disk always?
Thanks for the info. I agree, it makes sense the way it is designed. Pramod On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan mri...@gmail.com wrote: I agree, this is better handled by the filesystem cache - not to mention, being able to do zero copy writes. Regards, Mridul On Sat, May 2, 2015 at 10:26 PM, Reynold Xin r...@databricks.com wrote: I've personally prototyped completely in-memory shuffle for Spark 3 times. However, it is unclear how big of a gain it would be to put all of these in memory, under newer file systems (ext4, xfs). If the shuffle data is small, they are still in the file system buffer cache anyway. Note that network throughput is often lower than disk throughput, so it won't be a problem to read them from disk. And not having to keep all of these stuff in-memory substantially simplifies memory management. On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: Hi, I was trying to see if I can make Spark avoid hitting the disk for small jobs, but I see that the SortShuffleWriter.write() always writes to disk. I found an older thread ( http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html ) saying that it doesn't call fsync on this write path. My question is why does it always write to disk? Does it mean the reduce phase reads the result from the disk as well? Isn't it possible to read the data from map/buffer in ExternalSorter directly during the reduce phase? Thanks, Pramod
Re: [discuss] ending support for Java 6?
Should be, but isn't what Jenkins does. https://issues.apache.org/jira/browse/SPARK-1437 At this point it might be simpler to just decide that 1.5 will require Java 7 and then the Jenkins setup is correct. (NB: you can also solve this by setting bootclasspath to JDK 6 libs even when using javac 7+ but I think this is overly complicated.) On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan mri...@gmail.com wrote: Hi Shane, Since we are still maintaining support for jdk6, jenkins should be using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher api which breaks source level compat. -source and -target is insufficient to ensure api usage is conformant with the minimum jdk version we are supporting. Regards, Mridul [1] Not jdk7 as you mentioned On Sat, May 2, 2015 at 8:53 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're doing right now, java 7 is the default/standard on our jenkins. or, i vote we buy a butler's outfit for thomas and have a second jenkins instance... ;) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
LDA and PageRank Using GraphX
Hi All, I am looking to run LDA for topic modeling and page rank algorithms that comes with GraphX for some data analysis. Are there are any examples (GraphX) that I can take a look ? Thanks Praveen
Re: Speeding up Spark build during development
https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com] Sent: Friday, May 01, 2015 1:46 AM To: dev@spark.apache.org Subject: Speeding up Spark build during development Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: Speeding up Spark build during development
This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com] Sent: Friday, May 01, 2015 1:46 AM To: dev@spark.apache.org Subject: Speeding up Spark build during development Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: Submit Kill Spark Application program programmatically from another application
Sounds like you are in Yarn-Cluster mode. I created a JIRA SPARK-3913 https://issues.apache.org/jira/browse/SPARK-3913 and PR https://github.com/apache/spark/pull/2786 is this what you looking for ? Chester On Sat, May 2, 2015 at 10:32 PM, Yijie Shen henry.yijies...@gmail.com wrote: Hi, I’ve posted this problem in user@spark but find no reply, therefore moved to dev@spark, sorry for duplication. I am wondering if it is possible to submit, monitor kill spark applications from another service. I have wrote a service this: parse user commands translate them into understandable arguments to an already prepared Spark-SQL application submit the application along with arguments to Spark Cluster using spark-submit from ProcessBuilder run generated applications' driver in cluster mode. The above 4 steps has been finished, but I have difficulties in these two: Query about the applications status, for example, the percentage completion. Kill queries accordingly What I find in spark standalone documentation suggest kill application using: ./bin/spark-class org.apache.spark.deploy.Client kill master url driver ID And should find the driver ID through the standalone Master web UI at http://master url:8080. Are there any programmatically methods I could get the driverID submitted by my `ProcessBuilder` and query status about the query? Any Suggestions? — Best Regards! Yijie Shen
Re: createDataFrame allows column names as second param in Python not in Scala
We can't drop the existing createDataFrame one, since it breaks API compatibility, and the existing one also automatically infers the column name for case classes (in that case users most likely won't be declaring names directly). If this is really a problem, we should just create a new function (maybe more than one, since you could argue the one for Seq should also have that ...). On Sun, May 3, 2015 at 2:13 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: I have the perfect counter example where some of the data scientists prototype in Python and the production materials is done in Scala. But I get your point, as a matter of fact I realised the toDF method took parameters a little while after posting this. However the toDF still needs you to go from a List to an RDD, or create a useless Dataframe and call toDF on it re-creating a complete data structure. I just feel that the createDataFrame(_: Seq) is not really useful as it is, because I think there are practically no circumstances where you'd want to create a DataFrame without column names. I'm not implying a n-th overloaded method should be created, rather than change the signature of the existing method with an optional Seq of column names. Regards, Olivier. Le dim. 3 mai 2015 à 07:44, Reynold Xin r...@databricks.com a écrit : Part of the reason is that it is really easy to just call toDF on Scala, and we already have a lot of createDataFrame functions. (You might find some of the cross-language differences confusing, but I'd argue most real users just stick to one language, and developers or trainers are the only ones that need to constantly switch between languages). On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, SQLContext.createDataFrame has different behaviour in Scala or Python : l = [('Alice', 1)] sqlContext.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] sqlContext.createDataFrame(l, ['name', 'age']).collect() [Row(name=u'Alice', age=1)] and in Scala : scala val data = List((Alice, 1), (Wonderland, 0)) scala sqlContext.createDataFrame(data, List(name, score)) console:28: error: overloaded method value createDataFrame with alternatives: ... cannot be applied to ... What do you think about allowing in Scala too to have a Seq of column names for the sake of consistency ? Regards, Olivier.
Re: Multi-Line JSON in SparkSQL
How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot of small files though). If there is a better way, we should do it. On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier.
Question about PageRank with Live Journal
Hi, I have a question about running PageRan with live journal data as suggested by the example at org.apache.spark.examples.graphx.LiveJournalPageRank I ran with the following options bin/run-example org.apache.spark.examples.graphx.LiveJournalPageRank data/graphx/soc-LiveJournal1.txt --numEPart=1 And it seems that from the SparkUI, the data that mapPartitions at GraphImpl.scala:235 shuffle read size is steadily increasing all the way to 2.1GB on a single node machine. I think the shuffle read size should be decreasing as the number of messages decrease? I tried with 4 partitions and it seems that the shuffle read for mapPartitions job is decreasing as the program progresses. But I am not sure why it is actually increasing for one partition? And it really destroys the performance for a single partition even though single partition uses much less time on reduce phase than the 4-partitions configuration on a single node. Thanks
Re: Multi-Line JSON in SparkSQL
I'll try to study that and get back to you. Regards, Olivier. Le lun. 4 mai 2015 à 04:05, Reynold Xin r...@databricks.com a écrit : How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot of small files though). If there is a better way, we should do it. On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier.
Blockers for 1.4.0
I'd like to preemptively post the current list of 35 Blockers for release 1.4.0. (There are 53 Critical too, and a total of 273 JIRAs targeted for 1.4.0. Clearly most of that isn't accurate, so would be good to un-target most of that.) As a matter of process and hygiene, it would be best to either decide they're not Blockers at this point and reprioritize, or focus on addressing them, as we're now in the run up to release. I suggest that we shouldn't release with any Blockers outstanding, by definition. SPARK-7298 Web UI Harmonize style of new UI visualizations Patrick Wendell SPARK-7297 Web UI Make timeline more discoverable Patrick Wendell SPARK-7284 Documentation Streaming Update streaming documentation for Spark 1.4.0 release Tathagata Das SPARK-7228 SparkR SparkR public API for 1.4 release Shivaram Venkataraman SPARK-7158 SQL collect and take return different results SPARK-7139 Streaming Allow received block metadata to be saved to WAL and recovered on driver failure Tathagata Das SPARK-7111 Streaming Exposing of input data rates of non-receiver streams like Kafka Direct stream Saisai Shao SPARK-6941 SQL Provide a better error message to explain that tables created from RDDs are immutable SPARK-6923 SQL Spark SQL CLI does not read Data Source schema correctly SPARK-6906 SQL Refactor Connection to Hive Metastore Michael Armbrust SPARK-6831 Documentation PySpark SparkR SQL Document how to use external data sources SPARK-6824 SparkR Fill the docs for DataFrame API in SparkR SPARK-6812 SparkR filter() on DataFrame does not work as expected SPARK-6811 SparkR Building binary R packages for SparkR SPARK-6806 Documentation SparkR SparkR examples in programming guide Davies Liu SPARK-6784 SQL Clean up all the inbound/outbound conversions for DateType Yin Huai SPARK-6702 Streaming Web UI Update the Streaming Tab in Spark UI to show more batch information Tathagata Das SPARK-6654 Streaming Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library SPARK-5960 Streaming Allow AWS credentials to be passed to KinesisUtils.createStream() Chris Fregly SPARK-5948 SQL Support writing to partitioned table for the Parquet data source SPARK-5947 SQL First class partitioning support in data sources API SPARK-5920 Shuffle Use a BufferedInputStream to read local shuffle data Kay Ousterhout SPARK-5707 SQL Enabling spark.sql.codegen throws ClassNotFound exception SPARK-5517 SQL Add input types for Java UDFs SPARK-5463 SQL Fix Parquet filter push-down SPARK-5456 SQL Decimal Type comparison issue SPARK-5182 SQL Partitioning support for tables created by the data source API Cheng Lian SPARK-5180 SQL Data source API improvement SPARK-4867 SQL UDF clean up SPARK-2973 SQL Use LocalRelation for all ExecutedCommands avoid job for take/collect() Cheng Lian SPARK-2883 Input/Output SQL Spark Support for ORCFile format SPARK-2873 SQL Support disk spilling in Spark SQL aggregation Yin Huai SPARK-1517 Build Project Infra Publish nightly snapshots of documentation maven artifacts and binary builds Nicholas Chammas SPARK-1442 SQL Add Window function support - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] ending support for Java 6?
that bug predates my time at the amplab... :) anyways, just to restate: jenkins currently only builds w/java 7. if you folks need 6, i can make it happen, but it will be a (smallish) bit of work. shane On Sun, May 3, 2015 at 2:14 AM, Sean Owen so...@cloudera.com wrote: Should be, but isn't what Jenkins does. https://issues.apache.org/jira/browse/SPARK-1437 At this point it might be simpler to just decide that 1.5 will require Java 7 and then the Jenkins setup is correct. (NB: you can also solve this by setting bootclasspath to JDK 6 libs even when using javac 7+ but I think this is overly complicated.) On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan mri...@gmail.com wrote: Hi Shane, Since we are still maintaining support for jdk6, jenkins should be using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher api which breaks source level compat. -source and -target is insufficient to ensure api usage is conformant with the minimum jdk version we are supporting. Regards, Mridul [1] Not jdk7 as you mentioned On Sat, May 2, 2015 at 8:53 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're doing right now, java 7 is the default/standard on our jenkins. or, i vote we buy a butler's outfit for thomas and have a second jenkins instance... ;)
Re: createDataFrame allows column names as second param in Python not in Scala
I have the perfect counter example where some of the data scientists prototype in Python and the production materials is done in Scala. But I get your point, as a matter of fact I realised the toDF method took parameters a little while after posting this. However the toDF still needs you to go from a List to an RDD, or create a useless Dataframe and call toDF on it re-creating a complete data structure. I just feel that the createDataFrame(_: Seq) is not really useful as it is, because I think there are practically no circumstances where you'd want to create a DataFrame without column names. I'm not implying a n-th overloaded method should be created, rather than change the signature of the existing method with an optional Seq of column names. Regards, Olivier. Le dim. 3 mai 2015 à 07:44, Reynold Xin r...@databricks.com a écrit : Part of the reason is that it is really easy to just call toDF on Scala, and we already have a lot of createDataFrame functions. (You might find some of the cross-language differences confusing, but I'd argue most real users just stick to one language, and developers or trainers are the only ones that need to constantly switch between languages). On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, SQLContext.createDataFrame has different behaviour in Scala or Python : l = [('Alice', 1)] sqlContext.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] sqlContext.createDataFrame(l, ['name', 'age']).collect() [Row(name=u'Alice', age=1)] and in Scala : scala val data = List((Alice, 1), (Wonderland, 0)) scala sqlContext.createDataFrame(data, List(name, score)) console:28: error: overloaded method value createDataFrame with alternatives: ... cannot be applied to ... What do you think about allowing in Scala too to have a Seq of column names for the sake of consistency ? Regards, Olivier.
Multi-Line JSON in SparkSQL
Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier.