Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”) Chris On Jun 22, 2015, at 5:28 PM, ravi tella ddpis...@gmail.com wrote: Hello All, I am new to Spark. I have a very basic question.How do I write the output of an action on a RDD to HDFS? Thanks in advance

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
for the quick reply and the welcome. I am trying to read a file from hdfs and then writing back just the first line to hdfs. I calling first() on the RDD to get the first line. Sent from my iPhone On Jun 22, 2015, at 7:42 PM, Chris Gore cdg...@cdgore.com wrote: Hi Ravi, Welcome, you

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Chris Gore
I tried running this data set as described with my own implementation of L2 regularized logistic regression using LBFGS to compare: https://github.com/cdgore/fitbox https://github.com/cdgore/fitbox Intercept: -0.886745823033 Weights (['gre', 'gpa', 'rank']):[ 0.28862268 0.19402388 -0.36637964]

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Chris Gore
Good to hear there will be partitioning support. I’ve had some success loading partitioned data specified with Unix glowing format. i.e.: sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”) would load dates 2014-11-24 through 2014-11-30. Not the most ideal solution, but it

Re: MLLib sparse vector

2014-09-15 Thread Chris Gore
Hi Sameer, MLLib uses Breeze’s vector format under the hood. You can use that. http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector For example: import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV} val numClasses = classes.distinct.count.toInt val

Re: MLLib sparse vector

2014-09-15 Thread Chris Gore
`Vectors.sparse`: val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0))) where numProducts should be the largest product id plus one. Best, Xiangrui On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote: Hi Sameer, MLLib uses Breeze’s vector format under the hood

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Chris Gore
There is support for Spark in ElasticSearch’s Hadoop integration package. http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html Maybe you could split and insert all of your documents from Spark and then query for “MoreLikeThis” on the ElasticSearch index. I haven’t

Re: Error: No space left on device

2014-07-16 Thread Chris Gore
Hi Chris, I've encountered this error when running Spark’s ALS methods too. In my case, it was because I set spark.local.dir improperly, and every time there was a shuffle, it would spill many GB of data onto the local drive. What fixed it was setting it to use the /mnt directory, where a

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Chris Gore
We'd love to see a Spark user group in Los Angeles and connect with others working with it here. Ping me if you're in the LA area and use Spark at your company ( ch...@retentionscience.com ). Chris Retention Science call: 734.272.3099 visit: Site | like: Facebook | follow: Twitter On Mar