When will Spark SQL support building DB index natively?
Hi, In Spark SQL help document, it says Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes) For our use cases, DB index is quite important. I have about 300G data in our database, and we always use customer id as a predicate for DB look up. Without DB index, we will have to scan all 300G data, and it will take 1 minute for a simple DB look up, while MySQL only takes 10 seconds. We tried to create an independent table for each customer id, the result is pretty good, but the logic will be very complex. I'm wondering when will Spark SQL supports DB index, and before that, is there an alternative way to support DB index function? Thanks
Re: RDD data flow
Patrick Wendell wrote The Partition itself doesn't need to be an iterator - the iterator comes from the result of compute(partition). The Partition is just an identifier for that partition, not the data itself. OK, that makes sense. The docs for Partition are a bit vague on this point. Maybe I'll add this to the docs. Thanks Patrick! - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804p9820.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: running the Terasort example
On 12/16/14, 11:42 PM, Ewan Higgs ewan.hi...@ugent.be wrote: Hi Tim, On 16 Dec 2014, at 19:27, Tim Harsch thar...@cray.com wrote: Hi Ewan, Thanks, I think I was just a bit confused at the time, I was looking at the spark-perf repo when there was the problem (uh.. ok)… The PR that I am working on is indeed for spark-perf. Yes but the example usage you gave, is for the code in ehiggs/spark (which is where I got myself confused) ? git remote show origin * remote origin Fetch URL: g...@github.com:ehiggs/spark.git Push URL: g...@github.com:ehiggs/spark.git … ? ll bin/run-example -rwxr-xr-x 1 tharsch 513 2.1K Dec 11 21:02 bin/run-example run-example is not in spark-perf, What is the expected usage, for the code that is in spark-perf? I’m hoping I’ll have time to run it later today, so hopefully I will figure it out on my own. …snip... I can get past this by setting hadoop.version to 2.5.0 in the parent pom. I wasn’t sure how to get this working across all the Hadoop versions so I made it work with 2.4.0 and above. If you have advice on back porting this then I’m happy to implement it. I would like to try, hopefully I can find the time. NB, TeraValidate may not be functioning appropriately. If you have trouble with it, I recommend using the Hadoop version. Thanks for the warning, I bet I could have banged my head on that for hours. Yours, Ewan Thanks, Tim On 12/16/14, 12:38 AM, Ewan Higgs ewan.hi...@ugent.be wrote: Hi Tim, run-example is here: https://github.com/ehiggs/spark/blob/terasort/bin/run-example It should be in the repository that you cloned. So if you were at the top level of the checkout, run-example would be run as ./bin/run-example. Yours, Ewan Higgs On 12/12/14 01:06, Tim Harsch wrote: Hi all, I just joined the list, so I don¹t have a message history that would allow me to reply to this post: http://apache-spark-developers-list.1001551.n3.nabble.com/Terasort-exam pl e- td9284.html I am interested in running the terasort example. I cloned the repo https://github.com/ehiggs/spark and did checkout of the terasort branch. In the above referenced post Ewan gives the example # Generate 1M 100 byte records: ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in I don¹t see a ³run-example² in that repo. I¹m sure I am missing something basic, or less likely, maybe some changes weren¹t pushed? Thanks for any help, Tim - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet
Yeah, it looks like messages that are successfully posted via Nabble end up on the Apache mailing list, but messages posted directly to Apache aren't mirrored to Nabble anymore because it's based off the incubator mailing list. We should fix this so that Nabble posts to / archives the non-incubator list. On Sat, Dec 13, 2014 at 6:27 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Since you mentioned this, I had a related quandry recently -- it also says that the forum archives *u...@spark.incubator.apache.org u...@spark.incubator.apache.org/* *d...@spark.incubator.apache.org d...@spark.incubator.apache.org *respectively, yet the Community page clearly says to email the @spark.apache.org list (but the nabble archive is linked right there too). IMO even putting a clear explanation at the top Posting here requires that you create an account via the UI. Your message will be sent to both spark.incubator.apache.org and spark.apache.org (if that is the case, i'm not sure which alias nabble posts get sent to) would make things a lot more clear. On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com wrote: I've noticed that several users are attempting to post messages to Spark's user / dev mailing lists using the Nabble web UI ( http://apache-spark-user-list.1001560.n3.nabble.com/). However, there are many posts in Nabble that are not posted to the Apache lists and are flagged with This post has NOT been accepted by the mailing list yet. errors. I suspect that the issue is that users are not completing the sign-up confirmation process ( http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1), which is preventing their emails from being accepted by the mailing list. I wanted to mention this issue to the Spark community to see whether there are any good solutions to address this. I have spoken to users who think that our mailing list is unresponsive / inactive because their un-posted messages haven't received any replies. - Josh
Fwd: [VOTE] Release Apache Spark 1.2.0 (RC2)
Forgot Reply To All ;o( -- Forwarded message -- From: Krishna Sankar ksanka...@gmail.com Date: Wed, Dec 10, 2014 at 9:16 PM Subject: Re: [VOTE] Release Apache Spark 1.2.0 (RC2) To: Matei Zaharia matei.zaha...@gmail.com +1 Works same as RC1 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package 13:07 min 2. Tested pyspark, mlib - running as well as compare results with 1.1.x 2.1. statistics OK 2.2. Linear/Ridge/Laso Regression OK Slight difference in the print method (vs. 1.1.x) of the model object - with a label more details. This is good. 2.3. Decision Tree, Naive Bayes OK Changes in print(model) - now print (model.ToDebugString()) - OK Some changes in NaiveBayes. Different from my 1.1.x code - had to flatten list structures, zip required same number in partitions After code changes ran fine. 2.4. KMeans OK Center And Scale OK zip occasionally fails with error localhost): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition Has https://issues.apache.org/jira/browse/SPARK-2251 reappeared ? Made it work by doing a different transformation ie reusing an original rdd. (Xiangrui, I will end you the iPython Notebook the dataset by a separate e-mail) 2.5. rdd operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. recommendation OK 2.7. Good work ! In 1.x.x, had a map distinct over the movielens medium dataset which never worked. Works fine in 1.2.0 ! 3. Scala Mlib - subset of examples as in #2 above, with Scala 3.1. statistics OK 3.2. Linear Regression OK 3.3. Decision Tree OK 3.4. KMeans OK Cheers k/ On Wed, Dec 10, 2014 at 3:05 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On Dec 10, 2014, at 1:08 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1055/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.2.0! The vote is open until Saturday, December 13, at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What justifies a -1 vote for this release? == This vote is happening relatively late into the QA period, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.1.X, minor regressions, or bugs related to new features will not block this release. == What default changes should I be aware of? == 1. The default value of spark.shuffle.blockTransferService has been changed to netty -- Old behavior can be restored by switching to nio 2. The default value of spark.shuffle.manager has been changed to sort. -- Old behavior can be restored by setting spark.shuffle.manager to hash. == How does this differ from RC1 == This has fixes for a handful of issues identified - some of the notable fixes are: [Core] SPARK-4498: Standalone Master can fail to recognize completed/failed applications [SQL] SPARK-4552: Query for empty parquet table in spark sql hive get IllegalArgumentException SPARK-4753: Parquet2 does not prune based on OR filters on partition columns SPARK-4761: With JDBC server, set Kryo as default serializer and disable reference tracking SPARK-4785: When called with arguments referring column fields, PMOD throws NPE - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark Shell slowness on Google Cloud
Here's another data point: the slow part of my code is the construction of an RDD as the union of the textFile RDDs representing data from several distinct google storage directories. So the question becomes the following: what computation happens when calling the union method on two RDDs? On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta alexbare...@gmail.com wrote: Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans? On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com wrote: Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote: I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output: 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex