When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao

Hi, 
     In Spark SQL help document, it says Some of these (such as indexes) are 
less important due to Spark SQL’s in-memory  computational model. Others are 
slotted for future releases of Spark SQL.   
   - Block level bitmap indexes and virtual columns (used to build indexes)

     For our use cases, DB index is quite important. I have about 300G data in 
our database, and we always use customer id as a predicate for DB look up.  
Without DB index, we will have to scan all 300G data, and it will take  1 
minute for a simple DB look up, while MySQL only takes 10 seconds. We tried to 
create an independent table for each customer id, the result is pretty good, 
but the logic will be very complex. 
     I'm wondering when will Spark SQL supports DB index, and before that, is 
there an alternative way to support DB index function?
Thanks


Re: RDD data flow

2014-12-17 Thread Madhu
Patrick Wendell wrote
 The Partition itself doesn't need to be an iterator - the iterator
 comes from the result of compute(partition). The Partition is just an
 identifier for that partition, not the data itself.

OK, that makes sense. The docs for Partition are a bit vague on this point.
Maybe I'll add this to the docs.

Thanks Patrick!



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804p9820.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: running the Terasort example

2014-12-17 Thread Tim Harsch

On 12/16/14, 11:42 PM, Ewan Higgs ewan.hi...@ugent.be wrote:

Hi Tim,

 On 16 Dec 2014, at 19:27, Tim Harsch thar...@cray.com wrote:
 
 Hi Ewan,
 Thanks, I think I was just a bit confused at the time, I was looking at
 the spark-perf repo when there was the problem (uh.. ok)…
 
The PR that I am working on is indeed for spark-perf.
Yes but the example usage you gave, is for the code in ehiggs/spark (which
is where I got myself confused)

? git remote show origin
* remote origin
  Fetch URL: g...@github.com:ehiggs/spark.git
  Push  URL: g...@github.com:ehiggs/spark.git
…

? ll bin/run-example
-rwxr-xr-x  1 tharsch  513   2.1K Dec 11 21:02 bin/run-example


run-example is not in spark-perf, What is the expected usage, for the code
that is in spark-perf?  I’m hoping I’ll have time to run it later today,
so hopefully I will figure it out on my own.



 

 …snip...
 
 
 I can get past this by setting hadoop.version to 2.5.0 in the parent
pom.
 
I wasn’t sure how to get this working across all the Hadoop versions so I
made it work with 2.4.0 and above. If you have advice on back porting
this then I’m happy to implement it.

I would like to try, hopefully I can find the time.


NB, TeraValidate may not be functioning appropriately. If you have
trouble with it, I recommend using the Hadoop version.

Thanks for the warning, I bet I could have banged my head on that for
hours.


Yours,
Ewan

 Thanks,
 Tim
 
 
 On 12/16/14, 12:38 AM, Ewan Higgs ewan.hi...@ugent.be wrote:
 
 Hi Tim,
 run-example is here:
 https://github.com/ehiggs/spark/blob/terasort/bin/run-example
 
 It should be in the repository that you cloned. So if you were at the
 top level of the checkout, run-example would be run as
./bin/run-example.
 
 Yours,
 Ewan Higgs
 
 On 12/12/14 01:06, Tim Harsch wrote:
 Hi all,
 I just joined the list, so I don¹t have a message history that would
 allow
 me to reply to this post:
 
 
http://apache-spark-developers-list.1001551.n3.nabble.com/Terasort-exam
pl
 e-
 td9284.html
 
 I am interested in running the terasort example.  I cloned the repo
 https://github.com/ehiggs/spark and did checkout of the terasort
branch.
 In the above referenced post Ewan gives the example
 
 # Generate 1M 100 byte records:
   ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in
 
 
 I don¹t see a ³run-example² in that repo.  I¹m sure I am missing
 something
 basic, or less likely, maybe some changes weren¹t pushed?
 
 Thanks for any help,
 Tim
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 




Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-17 Thread Josh Rosen
Yeah, it looks like messages that are successfully posted via Nabble end up
on the Apache mailing list, but messages posted directly to Apache aren't
mirrored to Nabble anymore because it's based off the incubator mailing
list.  We should fix this so that Nabble posts to / archives the
non-incubator list.

On Sat, Dec 13, 2014 at 6:27 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Since you mentioned this, I had a related quandry recently -- it also says
 that the forum archives *u...@spark.incubator.apache.org
 u...@spark.incubator.apache.org/* *d...@spark.incubator.apache.org
 d...@spark.incubator.apache.org *respectively, yet the Community page
 clearly says to email the @spark.apache.org list (but the nabble archive
 is linked right there too). IMO even putting a clear explanation at the top

 Posting here requires that you create an account via the UI. Your message
 will be sent to both spark.incubator.apache.org and spark.apache.org (if
 that is the case, i'm not sure which alias nabble posts get sent to) would
 make things a lot more clear.

 On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com wrote:

 I've noticed that several users are attempting to post messages to
 Spark's user / dev mailing lists using the Nabble web UI (
 http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there
 are many posts in Nabble that are not posted to the Apache lists and are
 flagged with This post has NOT been accepted by the mailing list yet.
 errors.

 I suspect that the issue is that users are not completing the sign-up
 confirmation process (
 http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
 which is preventing their emails from being accepted by the mailing list.

 I wanted to mention this issue to the Spark community to see whether
 there are any good solutions to address this.  I have spoken to users who
 think that our mailing list is unresponsive / inactive because their
 un-posted messages haven't received any replies.

 - Josh




Fwd: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-17 Thread Krishna Sankar
Forgot Reply To All ;o(
-- Forwarded message --
From: Krishna Sankar ksanka...@gmail.com
Date: Wed, Dec 10, 2014 at 9:16 PM
Subject: Re: [VOTE] Release Apache Spark 1.2.0 (RC2)
To: Matei Zaharia matei.zaha...@gmail.com

+1
Works same as RC1
1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -DskipTests clean package 13:07 min
2. Tested pyspark, mlib - running as well as compare results with 1.1.x
2.1. statistics OK
2.2. Linear/Ridge/Laso Regression OK
   Slight difference in the print method (vs. 1.1.x) of the model
object - with a label  more details. This is good.
2.3. Decision Tree, Naive Bayes OK
   Changes in print(model) - now print (model.ToDebugString()) - OK
   Some changes in NaiveBayes. Different from my 1.1.x code - had to
flatten list structures, zip required same number in partitions
   After code changes ran fine.
2.4. KMeans OK
   Center And Scale OK
   zip occasionally fails with error localhost):
org.apache.spark.SparkException: Can only zip RDDs with same number of
elements in each partition
Has https://issues.apache.org/jira/browse/SPARK-2251 reappeared ?
Made it work by doing a different transformation ie reusing an original
rdd.
(Xiangrui, I will end you the iPython Notebook  the dataset by a separate
e-mail)
2.5. rdd operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. recommendation OK
2.7. Good work ! In 1.x.x, had a map distinct over the movielens medium
dataset which never worked. Works fine in 1.2.0 !
3. Scala Mlib - subset of examples as in #2 above, with Scala
3.1. statistics OK
3.2. Linear Regression OK
3.3. Decision Tree OK
3.4. KMeans OK
Cheers
k/

On Wed, Dec 10, 2014 at 3:05 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 +1

 Tested on Mac OS X.

 Matei

  On Dec 10, 2014, at 1:08 PM, Patrick Wendell pwend...@gmail.com wrote:
 
  Please vote on releasing the following candidate as Apache Spark version
 1.2.0!
 
  The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.2.0-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1055/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.2.0!
 
  The vote is open until Saturday, December 13, at 21:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.2.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What justifies a -1 vote for this release? ==
  This vote is happening relatively late into the QA period, so
  -1 votes should only occur for significant regressions from
  1.0.2. Bugs already present in 1.1.X, minor
  regressions, or bugs related to new features will not block this
  release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.shuffle.blockTransferService has been
  changed to netty
  -- Old behavior can be restored by switching to nio
 
  2. The default value of spark.shuffle.manager has been changed to
 sort.
  -- Old behavior can be restored by setting spark.shuffle.manager to
 hash.
 
  == How does this differ from RC1 ==
  This has fixes for a handful of issues identified - some of the
  notable fixes are:
 
  [Core]
  SPARK-4498: Standalone Master can fail to recognize completed/failed
  applications
 
  [SQL]
  SPARK-4552: Query for empty parquet table in spark sql hive get
  IllegalArgumentException
  SPARK-4753: Parquet2 does not prune based on OR filters on partition
 columns
  SPARK-4761: With JDBC server, set Kryo as default serializer and
  disable reference tracking
  SPARK-4785: When called with arguments referring column fields, PMOD
 throws NPE
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?

On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Well, what do you suggest I run to test this? But more importantly, what
 information would this give me?

 On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:

 Oh, it makes sense of gsutil scans through this quickly, but I was
 wondering if running a Hadoop job / bdutil would result in just as fast
 scans?


 On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com
 wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex