Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-22 Thread andy petrella
heya, I'd say if you wanna go the spark and scala way, please yourself and
go for the Spark Notebook 
(check http://spark-notebook.io/ for pre-built distro or build your own)
hth

On Thu, Sep 22, 2016 at 12:45 AM Arif,Mubaraka 
wrote:

> we installed it but the kernel dies.
> Any clue, why ?
>
> thanks for the link :)-
>
> ~muby
>
> 
> From: Jakob Odersky [ja...@odersky.com]
> Sent: Wednesday, September 21, 2016 4:54 PM
> To: Arif,Mubaraka
> Cc: User; Toivola,Sami
> Subject: Re: Has anyone installed the scala kernel for Jupyter notebook
>
> One option would be to use Apache Toree. A quick setup guide can be
> found here
> https://urldefense.proofpoint.com/v2/url?u=https-3A__toree.incubator.apache.org_documentation_user_quick-2Dstart=CwIBaQ=RI9dKKMRNVHr9NFa7OQiQw=dUN85GiSQZVDs0gTK4x1mSiAdXTZ-7F0KzGt2fcse38=aQ-ch2WNqv83T9vSyNogXuQZ5X3hK9k6MRt7uUhtfmg=qc-mcUm9Yx0_kXIfKLy0FUmsv_pRLZyCIHI7nzLbKr0=
>
> On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka 
> wrote:
> > Has anyone installed the scala kernel for Jupyter notebook.
> >
> >
> >
> > Any blogs or steps to follow in appreciated.
> >
> >
> >
> > thanks,
> >
> > Muby
> >
> > - To
> > unsubscribe e-mail: user-unsubscr...@spark.apache.org
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
andy


Re: Using Zeppelin with Spark FP

2016-09-11 Thread andy petrella
Heya, probably worth giving the Spark Notebook
 a go then.
It can plot any scala data (collection, rdd, df, ds, custom, ...), all are
reactive so they can deal with any sort of incoming data. You can ask on
the gitter  if you like.

hth
cheers

On Sun, Sep 11, 2016 at 11:12 PM Mich Talebzadeh 
wrote:

> Hi,
>
> Zeppelin is getting better.
>
> In its description it says:
>
> [image: image.png]
>
> So far so good. One feature that I have not managed to work out is
> creating plots with Spark functional programming. I can get SQL going by
> connecting to Spark thrift server and you can plot the results
>
> [image: image.png]
>
> However, if I wrote that using functional programming I won't be able to
> plot it. the plot feature is not available.
>
> Is this correct or I am missing something?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
-- 
andy


Re: Scala Vs Python

2016-09-02 Thread andy petrella
looking at the examples, indeed they make nonsense :D

On Fri, 2 Sep 2016 16:48 Mich Talebzadeh,  wrote:

> Right so. We are back into religious arguments. Best of luck
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:35, Nicholas Chammas  > wrote:
>
>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
>> wrote:
>>
>>> I believe as we progress in time Spark is going to move away from
>>> Python. If you look at 2014 Databricks code examples, they were mostly
>>> in Python. Now they are mostly in Scala for a reason.
>>>
>>
>> That's complete nonsense.
>>
>> First off, you can find dozens and dozens of Python code examples here:
>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>
>> The Python API was added to Spark in 0.7.0
>> , back in
>> February of 2013, before Spark was even accepted into the Apache incubator.
>> Since then it's undergone major and continuous development. Though it does
>> lag behind the Scala API in some areas, it's a first-class language and
>> bringing it up to parity with Scala is an explicit project goal. A quick
>> example off the top of my head is all the work that's going into model
>> import/export for Python: SPARK-11939
>> 
>>
>> Additionally, according to the 2015 Spark Survey
>> ,
>> 58% of Spark users use the Python API, more than any other language save
>> for Scala (71%). (Users can select multiple languages on the survey.)
>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>> after Windows and Spark Streaming users.
>>
>> Any notion that Spark is going to "move away from Python" is completely
>> contradicted by the facts.
>>
>> Nick
>>
>>
> --
andy


Re: spark and plot data

2016-07-23 Thread andy petrella
Heya,

Might be worth checking the spark-notebook  I
guess, it offers custom and reactive dynamic charts (scatter, line, bar,
pie, graph, radar, parallel, pivot, …) for any kind of data from an
intuitive and easy Scala API (with server side, incl. spark based, sampling
if needed).

There are many charts available natively, you can check this repo
 (specially the
notebook named Why Spark Notebook) and if you’re familiar with docker, you
can even simply do the following (and use spark 2.0)

docker datafellas/scala-for-data-science:1.0-spark2
docker run --rm -it --net=host -m 8g
datafellas/scala-for-data-science:1.0-spark2 bash



For any question, you can poke the community live on our gitter

or from github  of course
HTH
andy

On Sat, Jul 23, 2016 at 11:26 AM Gourav Sengupta 
wrote:

Hi Pedro,
>
> Toree is Scala kernel for Jupyter in case anyone needs a short intro. I
> use it regularly (when I am not using IntelliJ) and its quite good.
>
> Regards,
> Gourav
>
> On Fri, Jul 22, 2016 at 11:15 PM, Pedro Rodriguez  > wrote:
>
>> As of the most recent 0.6.0 release its partially alleviated, but still
>> not great (compared to something like Jupyter).
>>
>> They can be "downloaded" but its only really meaningful in importing it
>> back to Zeppelin. It would be great if they could be exported as HTML or
>> PDF, but at present they can't be. I know they have some sort of git
>> support, but it was never clear to me how it was suppose to be used since
>> the docs are sparse on that. So far what works best for us is S3 storage,
>> but you don't get the benefit of Github using that (history + commits etc).
>>
>> There are a couple other notebooks floating around, Apache Toree seems
>> the most promising for portability since its based on jupyter
>> https://github.com/apache/incubator-toree
>>
>> On Fri, Jul 22, 2016 at 3:53 PM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> The biggest stumbling block to using Zeppelin has been that we cannot
>>> download the notebooks, cannot export them and certainly cannot sync them
>>> back to Github, without mind numbing and sometimes irritating hacks. Have
>>> those issues been resolved?
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>>
>>> On Fri, Jul 22, 2016 at 2:22 PM, Pedro Rodriguez <
>>> ski.rodrig...@gmail.com> wrote:
>>>
 Zeppelin works great. The other thing that we have done in notebooks
 (like Zeppelin or Databricks) which support multiple types of spark session
 is register Spark SQL temp tables in our scala code then escape hatch to
 python for plotting with seaborn/matplotlib when the built in plots are
 insufficient.

 —
 Pedro Rodriguez
 PhD Student in Large-Scale Machine Learning | CU Boulder
 Systems Oriented Data Scientist
 UC Berkeley AMPLab Alumni

 pedrorodriguez.io | 909-353-4423
 github.com/EntilZha | LinkedIn
 

 On July 22, 2016 at 3:04:48 AM, Marco Colombo (
 ing.marco.colo...@gmail.com) wrote:

 Take a look at zeppelin

 http://zeppelin.apache.org

 Il giovedì 21 luglio 2016, Andy Davidson 
 ha scritto:

> Hi Pseudo
>
> Plotting, graphing, data visualization, report generation are common
> needs in scientific and enterprise computing.
>
> Can you tell me more about your use case? What is it about the current
> process / workflow do you think could be improved by pushing plotting (I
> assume you mean plotting and graphing) into spark.
>
>
> In my personal work all the graphing is done in the driver on summary
> stats calculated using spark. So for me using standard python libs has not
> been a problem.
>
> Andy
>
> From: pseudo oduesp 
> Date: Thursday, July 21, 2016 at 8:30 AM
> To: "user @spark" 
> Subject: spark and plot data
>
> Hi ,
> i know spark  it s engine  to compute large data set but for me i work
> with pyspark and it s very wonderful machine
>
> my question  we  don't have tools for ploting data each time we have
> to switch and go back to python for using plot.
> but when you have large result scatter plot or roc curve  you cant use
> collect to take data .
>
> somone have propostion for plot .
>
> thanks
>
>

 --
 Ing. Marco Colombo


>>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> 

Re: Spark 2.0 release date

2016-06-15 Thread andy petrella
Yeah well... the prior was high... but don't have enough data on Mich to
have an accurate likelihood :-)
But ok, my bad, I continue with the preview stuff and leave this thread in
peace ^^
tx ted
cheers

On Wed, Jun 15, 2016 at 4:47 PM Ted Yu <yuzhih...@gmail.com> wrote:

> Andy:
> You should sense the tone in Mich's response.
>
> To my knowledge, there hasn't been an RC for the 2.0 release yet.
> Once we have an RC, it goes through the normal voting process.
>
> FYI
>
> On Wed, Jun 15, 2016 at 7:38 AM, andy petrella <andy.petre...@gmail.com>
> wrote:
>
>> > tomorrow lunch time
>> Which TZ :-) → I'm working on the update of some materials that Dean
>> Wampler and myself will give tomorrow at Scala Days
>> <http://event.scaladays.org/scaladays-berlin-2016#!%23schedulePopupExtras-7601>
>>  (well
>> tomorrow CEST).
>>
>> Hence, I'm upgrading the materials on spark 2.0.0-preview, do you think
>> 2.0.0 will be released before 6PM CEST (9AM PDT)? I don't want to be a joke
>> in front of the audience with my almost cutting edge version :-P
>>
>> tx
>>
>>
>> On Wed, Jun 15, 2016 at 3:59 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Tomorrow lunchtime.
>>>
>>> Btw can you stop spamming every big data forum about good interview
>>> questions book for big data!
>>>
>>> I have seen your mails on this big data book in spark, hive and tez
>>> forums and I am sure there are many others. That seems to be the only mail
>>> you send around.
>>>
>>> This forum is for technical discussions not for promotional material.
>>> Please confine yourself to technical matters
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 15 June 2016 at 12:45, Chaturvedi Chola <chaturvedich...@gmail.com>
>>> wrote:
>>>
>>>> when is the spark 2.0 release planned
>>>>
>>>
>>> --
>> andy
>>
>
> --
andy


Re: Spark 2.0 release date

2016-06-15 Thread andy petrella
> tomorrow lunch time
Which TZ :-) → I'm working on the update of some materials that Dean
Wampler and myself will give tomorrow at Scala Days

(well
tomorrow CEST).

Hence, I'm upgrading the materials on spark 2.0.0-preview, do you think
2.0.0 will be released before 6PM CEST (9AM PDT)? I don't want to be a joke
in front of the audience with my almost cutting edge version :-P

tx


On Wed, Jun 15, 2016 at 3:59 PM Mich Talebzadeh 
wrote:

> Tomorrow lunchtime.
>
> Btw can you stop spamming every big data forum about good interview
> questions book for big data!
>
> I have seen your mails on this big data book in spark, hive and tez forums
> and I am sure there are many others. That seems to be the only mail you
> send around.
>
> This forum is for technical discussions not for promotional material.
> Please confine yourself to technical matters
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 June 2016 at 12:45, Chaturvedi Chola 
> wrote:
>
>> when is the spark 2.0 release planned
>>
>
> --
andy


Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread andy petrella
kool, voted and watched!
tx

On Tue, Jun 14, 2016 at 4:44 PM Cody Koeninger <c...@koeninger.org> wrote:

> I haven't done any significant work on using structured streaming with
> kafka, there's a jira ticket for tracking purposes
>
> https://issues.apache.org/jira/browse/SPARK-15406
>
>
>
> On Tue, Jun 14, 2016 at 9:21 AM, andy petrella <andy.petre...@gmail.com>
> wrote:
> > Heya folks,
> >
> > Just wondering if there are some doc regarding using kafka directly from
> the
> > reader.stream?
> > Has it been integrated already (I mean the source)?
> >
> > Sorry if the answer is RTFM (but then I'd appreciate a pointer anyway^^)
> >
> > thanks,
> > cheers
> > andy
> > --
> > andy
>
-- 
andy


[Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread andy petrella
Heya folks,

Just wondering if there are some doc regarding using kafka directly from
the reader.stream?
Has it been integrated already (I mean the source)?

Sorry if the answer is RTFM (but then I'd appreciate a pointer anyway^^)

thanks,
cheers
andy
-- 
andy


Re: Apache Flink

2016-04-17 Thread andy petrella
Just adding one thing to the mix: `that the latency for streaming data is
eliminated` is insane :-D

On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh 
wrote:

>  It seems that Flink argues that the latency for streaming data is
> eliminated whereas with Spark RDD there is this latency.
>
> I noticed that Flink does not support interactive shell much like Spark
> shell where you can add jars to it to do kafka testing. The advice was to
> add the streaming Kafka jar file to CLASSPATH but that does not work.
>
> Most Flink documentation also rather sparce with the usual example of word
> count which is not exactly what you want.
>
> Anyway I will have a look at it further. I have a Spark Scala streaming
> Kafka program that works fine in Spark and I want to recode it using Scala
> for Flink with Kafka but have difficulty importing and testing libraries.
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 02:41, Ascot Moss  wrote:
>
>> I compared both last month, seems to me that Flink's MLLib is not yet
>> ready.
>>
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Ted. I was wondering if someone is using both :)
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 April 2016 at 17:08, Ted Yu  wrote:
>>>
 Looks like this question is more relevant on flink mailing list :-)

 On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> Has anyone used Apache Flink instead of Spark by any chance
>
> I am interested in its set of libraries for Complex Event Processing.
>
> Frankly I don't know if it offers far more than Spark offers.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


>>>
>>
> --
andy


Re: Scala from Jupyter

2016-02-16 Thread andy petrella
That's great man.

Note, if you don't want to struggle too much that you just have to download
the version you need here http://spark-notebook.io/
It can be whatever you want zip, tgz, docker or even deb ^^. So pick the
flavor you like the most.

I'd recommend you two things:
1. build on master to have access to coool features including the variables
panels (a la RStudio)
2. go over the existing examples (around 40 atm), which are including
dynamic charts for streaming, graph, geospatial, machine learning, etc.
Some are funny :-P (the US History for instance). Oh, the charts are all
scala based but there are some implicits that will save your life coz they
will render what you were thinking of without asking...

Have fun!
cheers,



On Tue, Feb 16, 2016 at 1:43 PM Aleksandr Modestov <
aleksandrmodes...@gmail.com> wrote:

> Thank you!
> I will test Spark Notebook.
>
> On Tue, Feb 16, 2016 at 3:37 PM, andy petrella <andy.petre...@gmail.com>
> wrote:
>
>> Hello Alex!
>>
>> Rajeev is right, come over the spark notebook gitter room, you'll be
>> helped by many experienced people if you have some troubles:
>> https://gitter.im/andypetrella/spark-notebook
>>
>> The spark notebook has many integrated, reactive (scala) and extendable
>> (scala) plotting capabilities.
>>
>> cheers and have fun!
>> andy
>>
>> On Tue, Feb 16, 2016 at 1:04 PM Rajeev Reddy <rajeev.redd...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> Let me understand your query correctly.
>>>
>>> Case 1. You have a jupyter installation for python and you want to use
>>> it for scala.
>>> Solution: You can install kernels other than python Ref
>>> <https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages>
>>>
>>> Case 2. You want to use spark scala
>>> Solution: You can create a notebook config in which you create spark
>>> context and inject it back to your notebook OR Install other kind of
>>> notebooks like spark-notebook
>>> <https://github.com/andypetrella/spark-notebook/> or apache zeppelin
>>> <https://zeppelin.incubator.apache.org/>
>>>
>>>
>>> According to my experience for case 2. I have been using and prefer
>>> spark notebook over zeppelin
>>>
>>>
>>> On Tue, Feb 16, 2016 at 4:49 PM, AlexModestov <
>>> aleksandrmodes...@gmail.com> wrote:
>>>
>>>> Hello!
>>>> I want to use Scala from Jupyter (or may be something else if you could
>>>> recomend anything. I mean an IDE). Does anyone know how I can do this?
>>>> Thank you!
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-from-Jupyter-tp26234.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Rajeev Reddy
>>> Software Development Engineer  1
>>> IXP - Information Intelligence (I2) Team
>>> Flipkart Internet Pvt. Ltd (flipkart.com)
>>> http://rajeev-reddy.com
>>> +91-8001618957
>>>
>> --
>> andy
>>
>
> --
andy


Re: Scala from Jupyter

2016-02-16 Thread andy petrella
Hello Alex!

Rajeev is right, come over the spark notebook gitter room, you'll be helped
by many experienced people if you have some troubles:
https://gitter.im/andypetrella/spark-notebook

The spark notebook has many integrated, reactive (scala) and extendable
(scala) plotting capabilities.

cheers and have fun!
andy

On Tue, Feb 16, 2016 at 1:04 PM Rajeev Reddy 
wrote:

> Hello,
>
> Let me understand your query correctly.
>
> Case 1. You have a jupyter installation for python and you want to use it
> for scala.
> Solution: You can install kernels other than python Ref
> 
>
> Case 2. You want to use spark scala
> Solution: You can create a notebook config in which you create spark
> context and inject it back to your notebook OR Install other kind of
> notebooks like spark-notebook
>  or apache zeppelin
> 
>
>
> According to my experience for case 2. I have been using and prefer spark
> notebook over zeppelin
>
>
> On Tue, Feb 16, 2016 at 4:49 PM, AlexModestov  > wrote:
>
>> Hello!
>> I want to use Scala from Jupyter (or may be something else if you could
>> recomend anything. I mean an IDE). Does anyone know how I can do this?
>> Thank you!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-from-Jupyter-tp26234.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Thanks,
> Rajeev Reddy
> Software Development Engineer  1
> IXP - Information Intelligence (I2) Team
> Flipkart Internet Pvt. Ltd (flipkart.com)
> http://rajeev-reddy.com
> +91-8001618957
>
-- 
andy


Re: Graph visualization tool for GraphX

2015-12-08 Thread andy petrella
Hello Lin,

This is indeed a tough scenario when you have many vertices (and even
worst) many edges...

So two-fold answer:
First, technically, there is a graph plotting support in the spark notebook
(https://github.com/andypetrella/spark-notebook/ → check this notebook:
https://github.com/andypetrella/spark-notebook/blob/master/notebooks/viz/Graph%20Plots.snb).
You can plot graph from scala, which will convert to D3 with force layout
force field.
The number or the points which you will plot are "sampled" using a
`Sampler` that you can provide yourself. Which leads to the second fold of
this answer.

Plotting a large graph is rather tough because there is no real notion of
dimension... there is always the option to dig the topological analysis
theory to find good homeomorphism ... but won't be that efficient ;-D.
Best is to find a good approach to generalize/summarize the information,
there are many many techniques (that you can find in mainly geospatial viz
and biology viz theories...)
Best is to check what will match your need the fastest.
There are quick techniques like using unsupervised clustering models and
then plot a voronoi diagram (which can be approached using force layout).

In general term I might say that multiscaling is intuitively what you want
first: this is an interesting paper presenting the foundations:
https://www.cs.ubc.ca/~tmm/courses/533-07/readings/auberIV03Seattle.pdf

Oh and BTW, to end this longish mail, while looking for new papers on that,
I felt on this one:
http://vacommunity.org/egas2015/papers/IGAS2015-ScottLangevin.pdf which
is using
1. *Spark !!!*
2. a tile based approach (~ to tiling + pyramids in geospatial)

HTH

PS regarding the Spark Notebook, you can always come and discuss on gitter:
https://gitter.im/andypetrella/spark-notebook


On Tue, Dec 8, 2015 at 6:30 PM Lin, Hao  wrote:

> Hello Jorn,
>
>
>
> Thank you for the reply and being tolerant of my over simplified question.
> I should’ve been more specific.  Though ~TB of data, there will be about
> billions of records (edges) and 100,000 nodes. We need to visualize the
> social networks graph like what can be done by Gephi which has limitation
> on scalability to handle such amount of data. There will be dozens of users
> to access and the response time is also critical.  We would like to run the
> visualization tool on the remote ec2 server where webtool can be a good
> choice for us.
>
>
>
> Please let me know if I need to be more specific J.  Thanks
>
> hao
>
>
>
> *From:* Jörn Franke [mailto:jornfra...@gmail.com]
> *Sent:* Tuesday, December 08, 2015 11:31 AM
> *To:* Lin, Hao
> *Cc:* user@spark.apache.org
> *Subject:* Re: Graph visualization tool for GraphX
>
>
>
> I am not sure about your use case. How should a human interpret many
> terabytes of data in one large visualization?? You have to be more
> specific, what part of the data needs to be visualized, what kind of
> visualization, what navigation do you expect within the visualisation, how
> many users, response time, web tool vs mobile vs Desktop etc
>
>
> On 08 Dec 2015, at 16:46, Lin, Hao  wrote:
>
> Hi,
>
>
>
> Anyone can recommend a great Graph visualization tool for GraphX  that can
> handle truly large Data (~ TB) ?
>
>
>
> Thanks so much
>
> Hao
>
> Confidentiality Notice:: This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information. If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited. If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>
> Confidentiality Notice:: This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information. If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited. If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>
-- 
andy


Re: StructType for oracle.sql.STRUCT

2015-11-28 Thread andy petrella
Warf... such an heavy tasks man!

I'd love to follow your work on that (I've a long XP in geospatial too), is
there a repo available already for that?

The hard part will be to support all descendant types I guess (line,
mutlilines, and so on), then creating the spatial operators.

The only project I know that has this kindo of aim (although, it's very
limited and simple atm) is magellan https://github.com/harsha2010/magellan

andy


On Sat, Nov 28, 2015 at 9:15 PM Pieter Minnaar 
wrote:

> Hi,
>
> I need to read Oracle Spatial SDO_GEOMETRY tables into Spark.
>
> I need to know how to create a StructField to use in the schema definition
> for the Oracle geometry columns. In the standard JDBC the values are read
> as oracle.sql.STRUCT types.
>
> How can I get the same values in Spark SQL?
>
> Regards,
> Pieter
>
-- 
andy


Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread andy petrella
Hey,

Actually, for Scala, I'd better using
https://github.com/andypetrella/spark-notebook/

It's deployed at several places like *Alibaba*, *EBI*, *Cray* and is
supported by both the Scala community and the company Data Fellas.
For instance, it was part of the Big Scala Pipeline training given this
16th August at Galvanize in San Francisco with the collaboration of *Datastax,
Mesosphere, Databricks, Confluent and Typesafe*:
http://scala.bythebay.io/pipeline.html. It was a successful 100+ attendants
training day.

Also, it's the only one fully reactive including a reactive plotting
library in Scala, allowing you to creatively plot a moving average computed
in a DStream, or a D3 Graph layout dynamically updated or even a dynamic
map of the received tweets having geoloc set. Of course, you can plot
lines, pies, bars, hist, boxplot for any kind of data, being Dataframe, SQL
stuffs, Seq, List, Map or whatever of tuples or classes.

Checkout http://spark-notebook.io/, for your specific distro.
Note that you can also use it directly on DCOS.

For any question, I'll be glad helping you on the ~200 crowded gitter
chatroom: https://gitter.im/andypetrella/spark-notebook

cheers and have fun :-)


On Tue, Aug 18, 2015 at 10:24 PM Guru Medasani gdm...@gmail.com wrote:

 For python it is really great.

 There is some work in progress in bringing Scala support to Jupyter as
 well.

 https://github.com/hohonuuli/sparknotebook

 https://github.com/alexarchambault/jupyter-scala


 Guru Medasani
 gdm...@gmail.com



 On Aug 18, 2015, at 12:29 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi Guru,

 Thanks! Great to hear that someone tried it in production. How do you like
 it so far?

 Best Regards,

 Jerry


 On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani gdm...@gmail.com wrote:

 Hi Jerry,

 Yes. I’ve seen customers using this in production for data science work.
 I’m currently using this for one of my projects on a cluster as well.

 Also, here is a blog that describes how to configure this.


 http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/


 Guru Medasani
 gdm...@gmail.com



 On Aug 18, 2015, at 8:35 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users and developers,

 Did anyone have IPython Notebook (Jupyter) deployed in production that
 uses Spark as the computational engine?

 I know Databricks Cloud provides similar features with deeper integration
 with Spark. However, Databricks Cloud has to be hosted by Databricks so we
 cannot do this.

 Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython
 has already offered years ago. It would be great if someone can educate me
 the reason behind this.

 Best Regards,

 Jerry




 --
andy


Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread andy petrella
Exactly!

The sharing part is used in the Spark Notebook (this one
https://github.com/andypetrella/spark-notebook/blob/master/notebooks/Tachyon%20Test.snb)
so we can share stuffs between notebooks which are different SparkContext
(in diff JVM).

OTOH, we have a project that creates micro services on genomics data, for
several reasons we used Tachyon to server genomes cubes (ranges across
genomes), see here https://github.com/med-at-scale/high-health.

HTH
andy

On Fri, Aug 7, 2015 at 8:36 PM Calvin Jia jia.cal...@gmail.com wrote:

 Hi,

 Tachyon http://tachyon-project.org manages memory off heap which can
 help prevent long GC pauses. Also, using Tachyon will allow the data to be
 shared between Spark jobs if they use the same dataset.

 Here's http://www.meetup.com/Tachyon/events/222485713/ a production use
 case where Baidu runs Tachyon to get 30x performance improvement in their
 SparkSQL workload.

 Hope this helps,
 Calvin

 On Fri, Aug 7, 2015 at 9:42 AM, Muler mulugeta.abe...@gmail.com wrote:

 Spark is an in-memory engine and attempts to do computation in-memory.
 Tachyon is memory-centeric distributed storage, OK, but how would that help
 ran Spark faster?


 --
andy


Re: Re: Real-time data visualization with Zeppelin

2015-08-06 Thread andy petrella
Yep, most of the things will work just by renaming it :-D
You can even use nbconvert afterwards


On Thu, Aug 6, 2015 at 12:09 PM jun kit...@126.com wrote:

 Hi andy,

 Is there any method to convert ipython notebook file(.ipynb) to spark
 notebook file(.snb) or vice versa?

 BR
 Jun

 At 2015-07-13 02:45:57, andy petrella andy.petre...@gmail.com wrote:

 Heya,

 You might be looking for something like this I guess:
 https://www.youtube.com/watch?v=kB4kRQRFAVc.

 The Spark-Notebook (https://github.com/andypetrella/spark-notebook/) can
 bring that to you actually, it uses fully reactive bilateral communication
 streams to update data and viz, plus it hides almost everything for you ^^.
 The video was using the notebook notebooks/streaming/Twitter stream.snb
 https://github.com/andypetrella/spark-notebook/blob/master/notebooks/streaming/Twitter%20stream.snb
  so
 you can play it yourself if you like.

 You might want building the master (before 0.6.0 will be released → soon)
 here http://spark-notebook.io/.

 HTH
 andy



 On Sun, Jul 12, 2015 at 8:29 PM Ruslan Dautkhanov dautkha...@gmail.com
 wrote:

 Don't think it is a Zeppelin problem.. RDDs are immutable.
 Unless you integrate something like IndexedRDD
 http://spark-packages.org/package/amplab/spark-indexedrdd
 into Zeppelin I think it's not possible.


 --
 Ruslan Dautkhanov

 On Wed, Jul 8, 2015 at 3:24 PM, Brandon White bwwintheho...@gmail.com
 wrote:

 Can you use a con job to update it every X minutes?

 On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Hi all – I’m just wondering if anyone has had success integrating Spark
 Streaming with Zeppelin and actually dynamically updating the data in near
 real-time. From my investigation, it seems that Zeppelin will only allow
 you to display a snapshot of data, not a continuously updating table. Has
 anyone figured out if there’s a way to loop a display command or how to
 provide a mechanism to continuously update visualizations?

 Thank you,
 Ilya Ganelin

 [image: 2DD951D6-FF99-4415-80AA-E30EFE7CF452[4].png]

 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates and may only be used
 solely in performance of work or services for Capital One. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed. If the reader of this message is not the intended
 recipient, you are hereby notified that any review, retransmission,
 dissemination, distribution, copying or other use of, or taking of any
 action in reliance upon this information is strictly prohibited. If you
 have received this communication in error, please contact the sender and
 delete the material from your computer.



 --
andy


Re: Real-time data visualization with Zeppelin

2015-07-12 Thread andy petrella
Heya,

You might be looking for something like this I guess:
https://www.youtube.com/watch?v=kB4kRQRFAVc.

The Spark-Notebook (https://github.com/andypetrella/spark-notebook/) can
bring that to you actually, it uses fully reactive bilateral communication
streams to update data and viz, plus it hides almost everything for you ^^.
The video was using the notebook notebooks/streaming/Twitter stream.snb
https://github.com/andypetrella/spark-notebook/blob/master/notebooks/streaming/Twitter%20stream.snb
so
you can play it yourself if you like.

You might want building the master (before 0.6.0 will be released → soon)
here http://spark-notebook.io/.

HTH
andy



On Sun, Jul 12, 2015 at 8:29 PM Ruslan Dautkhanov dautkha...@gmail.com
wrote:

 Don't think it is a Zeppelin problem.. RDDs are immutable.
 Unless you integrate something like IndexedRDD
 http://spark-packages.org/package/amplab/spark-indexedrdd
 into Zeppelin I think it's not possible.


 --
 Ruslan Dautkhanov

 On Wed, Jul 8, 2015 at 3:24 PM, Brandon White bwwintheho...@gmail.com
 wrote:

 Can you use a con job to update it every X minutes?

 On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Hi all – I’m just wondering if anyone has had success integrating Spark
 Streaming with Zeppelin and actually dynamically updating the data in near
 real-time. From my investigation, it seems that Zeppelin will only allow
 you to display a snapshot of data, not a continuously updating table. Has
 anyone figured out if there’s a way to loop a display command or how to
 provide a mechanism to continuously update visualizations?

 Thank you,
 Ilya Ganelin

 [image: 2DD951D6-FF99-4415-80AA-E30EFE7CF452[4].png]

 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates and may only be used
 solely in performance of work or services for Capital One. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed. If the reader of this message is not the intended
 recipient, you are hereby notified that any review, retransmission,
 dissemination, distribution, copying or other use of, or taking of any
 action in reliance upon this information is strictly prohibited. If you
 have received this communication in error, please contact the sender and
 delete the material from your computer.






Re: Machine Learning on GraphX

2015-06-18 Thread andy petrella
I guess that belief propagation could help here (at least, I find the ideas
enough similar), thus this article might be a good start :
http://arxiv.org/pdf/1004.1003.pdf
(it's on my todo list, hence cannot really help further ^^)

On Thu, Jun 18, 2015 at 11:44 AM Timothée Rebours t.rebo...@gmail.com
wrote:

 Thanks for the quick answer.
 I've already followed this tutorial but it doesn't use GraphX at all. My
 goal would be to work directly on the graph, and not extracting edges and
 vertices from the graph as standard RDDs and then work on that with the
 standard MLlib's ALS, which has no interest. That's why I tried with the
 other implementation, but it's not optimized at all.

 I might have gone in the wrong direction with the ALS, but I'd like to see
 what's possible to do with MLlib on GraphX. Any idea ?

 2015-06-18 11:19 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:

 This might give you a good start
 http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
 its a bit old though.

 Thanks
 Best Regards

 On Thu, Jun 18, 2015 at 2:33 PM, texol t.rebo...@gmail.com wrote:

 Hi,

 I'm new to GraphX and I'd like to use Machine Learning algorithms on top
 of
 it. I wanted to write a simple program implementing MLlib's ALS on a
 bipartite graph (a simple movie recommendation), but didn't succeed. I
 found
 an implementation on Spark 1.1.x
 (
 https://github.com/ankurdave/spark/blob/GraphXALS/graphx/src/main/scala/org/apache/spark/graphx/lib/ALS.scala
 )
 of ALS on GraphX, but it is painfully slow compared to the standard
 implementation, and uses the deprecated (in the current version)
 PregelVertex class.
 Do we expect a new implementation ? Is there a smarter solution to do so
 ?

 Thanks,
 Regards,
 Timothée Rebours.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Machine-Learning-on-GraphX-tp23388.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Timothée Rebours
 13, rue Georges Bizet
 78380 BOUGIVAL



Re: [Streaming] Configure executor logging on Mesos

2015-05-30 Thread andy petrella
Hello,

I'm currently exploring DCOS for the spark notebook, and while looking at
the spark configuration I found something interesting which is actually
converging to what we've discovered:
https://github.com/mesosphere/universe/blob/master/repo/packages/S/spark/0/marathon.json

So the logging is working fine here because the spark package is using the
spark-class which is able to configure the log4j file. But the interesting
part comes with the fact that the `uris` parameter is filled in with a
downloadable path to the log4j file!

However, it's not possible when creating the spark context ourselfves and
relying on  the mesos sheduler backend only. Unles the spark.executor.uri
(or a another one) can take more than one downloadable path.

my.2¢

andy

On Fri, May 29, 2015 at 5:09 PM Gerard Maas gerard.m...@gmail.com wrote:

 Hi Tim,

 Thanks for the info.   We (Andy Petrella and myself) have been diving a
 bit deeper into this log config:

 The log line I was referring to is this one (sorry, I provided the others
 just for context)

 *Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties*

 That line comes from Logging.scala [1] where a default config is loaded is
 none is found in the classpath upon the startup of the Spark Mesos executor
 in the Mesos sandbox. At that point in time, none of the
 application-specific resources have been shipped yet as the executor JVM is
 just starting up.   To load a custom configuration file we should have it
 already on the sandbox before the executor JVM starts and add it to the
 classpath on the startup command. Is that correct?

 For the classpath customization, It looks like it should be possible to
 pass a -Dlog4j.configuration  property by using the
 'spark.executor.extraClassPath' that will be picked up at [2] and that
 should be added to the command that starts the executor JVM, but the
 resource must be already on the host before we can do that. Therefore we
 also need some means of 'shipping' the log4j.configuration file to the
 allocated executor.

 This all boils down to your statement on the need of shipping extra files
 to the sandbox. Bottom line: It's currently not possible to specify a
 config file for your mesos executor. (ours grows several GB/day).

 The only workaround I found so far is to open up the Spark assembly,
 replace the log4j-default.properties and pack it up again.  That would
 work, although kind of rudimentary as we use the same assembly for many
 jobs.  Probably, accessing the log4j API programmatically should also work
 (I didn't try that yet)

 Should we open a JIRA for this functionality?

 -kr, Gerard.




 [1]
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Logging.scala#L128
 [2]
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L77

 On Thu, May 28, 2015 at 7:50 PM, Tim Chen t...@mesosphere.io wrote:


 -- Forwarded message --
 From: Tim Chen t...@mesosphere.io
 Date: Thu, May 28, 2015 at 10:49 AM
 Subject: Re: [Streaming] Configure executor logging on Mesos
 To: Gerard Maas gerard.m...@gmail.com


 Hi Gerard,

 The log line you referred to is not Spark logging but Mesos own logging,
 which is using glog.

 Our own executor logs should only contain very few lines though.

 Most of the log lines you'll see is from Spark, and it can be controled
 by specifiying a log4j.properties to be downloaded with your Mesos task.
 Alternatively if you are downloading Spark executor via spark.executor.uri,
 you can include log4j.properties in that tar ball.

 I think we probably need some more configurations for Spark scheduler to
 pick up extra files to be downloaded into the sandbox.

 Tim





 On Thu, May 28, 2015 at 6:46 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 I'm trying to control the verbosity of the logs on the Mesos executors
 with no luck so far. The default behaviour is INFO on stderr dump with an
 unbounded growth that gets too big at some point.

 I noticed that when the executor is instantiated, it locates a default
 log configuration in the spark assembly:

 I0528 13:36:22.958067 26890 exec.cpp:206] Executor registered on slave
 20150528-063307-780930314-5050-8152-S5
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties

 So, no matter what I provide in my job jar files (or also tried with
 (spark.executor.extraClassPath=log4j.properties) takes effect in the
 executor's configuration.

 How should I configure the log on the executors?

 thanks, Gerard.







Re: Running Javascript from scala spark

2015-05-26 Thread andy petrella
Yop, why not using like you said a js engine le rhino? But then I would
suggest using mapPartition instead si only one engine per partition.
Probably broadcasting the script is also a good thing to do.

I guess it's for add hoc transformations passed by a remote client,
otherwise you could simply convert the js into Scala, right?

HTH
Andy

Le mar. 26 mai 2015 21:03, marcos rebelo ole...@gmail.com a écrit :

 Hi all

 Let me be clear, I'm speaking of Spark (big data, map/reduce, hadoop, ...
 related). I have multiple map/flatMap/groupBy and one of the steps needs to
 be a map passing the item inside a JavaScript code.

 2 Questions:
  - Is this question related to this list?
  - Did someone do something similar?

 Best Regards
 Marcos Rebelo



 On Tue, May 26, 2015 at 8:03 PM, Marcelo Vanzin van...@cloudera.com
 wrote:

 Is it just me or does that look completely unrelated to
 Spark-the-Apache-project?

 On Tue, May 26, 2015 at 10:55 AM, Ted Yu yuzhih...@gmail.com wrote:

 Have you looked at https://github.com/spark/sparkjs ?

 Cheers

 On Tue, May 26, 2015 at 10:17 AM, marcos rebelo ole...@gmail.com
 wrote:

 Hi all,

 My first message on this mailing list:

 I need to run JavaScript on Spark. Somehow I would like to use the
 ScriptEngineManager or any other way that makes Rhino do the work for me.

 Consider that I have a Structure that needs to be changed by a
 JavaScript. I will have a set of Javascript and depending on the structure
 I will do some calculation.

 Did someone make it work and can get me a simple snippet that works?

 Thanks for any support

 Best Regards
 Marcos Rebelo





 --
 Marcelo





Re: solr in spark

2015-04-28 Thread andy petrella
AFAIK Datastax is heavily looking at it. they have a good integration of
Cassandra with it. the next was clearly to have a strong combination of the
three in one of the coming releases

Le mar. 28 avr. 2015 18:28, Jeetendra Gangele gangele...@gmail.com a
écrit :

 Does anyone tried using solr inside spark?
 below is the project describing it.
 https://github.com/LucidWorks/spark-solr.

 I have a requirement in which I want to index 20 millions companies name
 and then search as and when new data comes in. the output should be list of
 companies matching the query.

 Spark has inbuilt elastic search but for this purpose Elastic search is
 not a good option since this is totally text search problem?

 Elastic search is good  for filtering and grouping.

 Does any body used solr inside spark?

 Regards
 jeetendra




Re: Spark and accumulo

2015-04-21 Thread andy petrella
Hello Madvi,

Some work has been done by @pomadchin using the spark notebook, maybe you
should come on https://gitter.im/andypetrella/spark-notebook and poke him?
There are some discoveries he made that might be helpful to know.

Also you can poke @lossyrob from Azavea, he did that for geotrellis

my0.2c
andy


On Tue, Apr 21, 2015 at 9:25 AM Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can simply use a custom inputformat (AccumuloInputFormat) with the
 hadoop RDDs (sc.newApiHadoopFile etc) for that, all you need to do is to
 pass the jobConfs. Here's pretty clean discussion:
 http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook#answers-header

 Thanks
 Best Regards

 On Tue, Apr 21, 2015 at 9:55 AM, madhvi madhvi.gu...@orkash.com wrote:

 Hi all,

 Is there anything to integrate spark with accumulo or make spark to
 process over accumulo data?

 Thanks
 Madhvi Gupta

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark 1.2.0 with Play/Activator

2015-04-07 Thread andy petrella
Mmmh, you want it running a spark 1.2 with hadoop 2.5.0-cdh5.3.2 right?

If I'm not wrong you might have to launch it like so:
```
sbt -Dspark.version=1.2.0 -Dhadoop.version=2.5.0-cdh5.3.2
```

Or you can download it from http://spark-notebook.io if you want.

HTH
andy



On Tue, Apr 7, 2015 at 9:06 AM Manish Gupta 8 mgupt...@sapient.com wrote:

  If I try to build spark-notebook with spark.version=1.2.0-cdh5.3.0,
 sbt throw these warnings before failing to compile:



 :: org.apache.spark#spark-yarn_2.10;1.2.0-cdh5.3.0: not found

 :: org.apache.spark#spark-repl_2.10;1.2.0-cdh5.3.0: not found



 Any suggestions?



 Thanks



 *From:* Manish Gupta 8 [mailto:mgupt...@sapient.com]
 *Sent:* Tuesday, April 07, 2015 12:04 PM
 *To:* andy petrella; user@spark.apache.org
 *Subject:* RE: Spark 1.2.0 with Play/Activator



 Thanks for the information Andy. I will go through the versions mentioned
 in Dependencies.scala to identify the compatibility.



 Regards,

 Manish





 *From:* andy petrella [mailto:andy.petre...@gmail.com
 andy.petre...@gmail.com]
 *Sent:* Tuesday, April 07, 2015 11:04 AM
 *To:* Manish Gupta 8; user@spark.apache.org
 *Subject:* Re: Spark 1.2.0 with Play/Activator



 Hello Manish,

 you can take a look at the spark-notebook build, it's a bit tricky to get
 rid of some clashes but at least you can refer to this build to have ideas.
 LSS, I have stripped out akka from play deps.

 ref:
 https://github.com/andypetrella/spark-notebook/blob/master/build.sbt

 https://github.com/andypetrella/spark-notebook/blob/master/project/Dependencies.scala

 https://github.com/andypetrella/spark-notebook/blob/master/project/Shared.scala

 HTH, cheers
 andy



 Le mar 7 avr. 2015 07:26, Manish Gupta 8 mgupt...@sapient.com a écrit :

  Hi,



 We are trying to build a Play framework based web application integrated
 with Apache Spark. We are running *Apache Spark 1.2.0 CDH 5.3.0*. But
 struggling with akka version conflicts (errors like
 java.lang.NoSuchMethodError in akka). We have tried Play 2.2.6 as well as
 Activator 1.3.2.



 If anyone has successfully integrated Spark 1.2.0 with Play/Activator,
 please share the version we should use and akka dependencies we should add
 in Build.sbt.



 Thanks,

 Manish




Re: Processing Large Images in Spark?

2015-04-07 Thread andy petrella
Heya,

You might be interesting at looking at GeoTrellis
They use RDDs of Tiles to process big images like Landsat ones can be
(specially 8).

However, I see you have only 1G per file, so I guess you only care of a
single band? Or is it a reboxed pic?

Note: I think the GeoTrellis image format is still single band, although
it's highly optimized for distributed geoprocessing

my2¢
andy


On Tue, Apr 7, 2015 at 12:06 AM Patrick Young 
patrick.mckendree.yo...@gmail.com wrote:

 Hi all,

 I'm new to Spark and wondering if it's appropriate to use for some image
 processing tasks on pretty sizable (~1 GB) images.

 Here is an example use case.  Amazon recently put the entire Landsat8
 archive in S3:

 http://aws.amazon.com/public-data-sets/landsat/

 I have a bunch of GDAL based (a C library for geospatial raster I/O)
 Python scripts that take a collection of Landsat images and mash them into
 a single mosaic.  This works great for little mosaics, but if I wanted to
 do the entire world, I need more horse power!  The scripts do the following:

1. Copy the selected rasters down from S3 to the local file system
2. Read each image into memory as numpy arrays (a big 3D array), do
some image processing using various Python libs, and write the result out
to the local file system
3. Blast all the processed imagery back to S3, and hooks up MapServer
for viewing

 Step 2 takes a long time; this is what I'd like to leverage Spark for.
 Each image, if you stack all the bands together, can be ~1 GB in size.

 So here are a couple of questions:


1. If I have a large image/array, what's a good way of getting it into
an RDD?  I've seen some stuff about folks tiling up imagery into little
chunks and storing it in HBase.  I imagine I would want an image chunk in
each partition of the RDD.  If I wanted to apply something like a gaussian
filter I'd need each chunk to to overlap a bit.
2. In a similar vain, does anyone have any thoughts on storing a
really large raster in HDFS?  Seems like if I just dump the image into HDFS
as it, it'll get stored in blocks all across the system and when I go to
read it, there will be a ton of network traffic from all the blocks to the
reading node!
3. How is the numpy's ndarray support in Spark?  For instance, if I do
a map on my theoretical chunked image RDD, can I easily realize the image
chunk as a numpy array inside the function?  Most of the Python algorithms
I use take in and return a numpy array.

 I saw some discussion in the past on image processing:

 These threads talk about processing lots of little images, but this isn't
 really my situation as I've got one very large image:


 http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html

 http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html

 Further, I'd like to have the imagery in HDFS rather than on the file
 system to avoid I/O bottlenecks if possible!

 Thanks for any ideas and advice!
 -Patrick





Re: Spark 1.2.0 with Play/Activator

2015-04-06 Thread andy petrella
Hello Manish,

you can take a look at the spark-notebook build, it's a bit tricky to get
rid of some clashes but at least you can refer to this build to have ideas.
LSS, I have stripped out akka from play deps.

ref:
https://github.com/andypetrella/spark-notebook/blob/master/build.sbt
https://github.com/andypetrella/spark-notebook/blob/master/project/Dependencies.scala
https://github.com/andypetrella/spark-notebook/blob/master/project/Shared.scala

HTH, cheers
andy

Le mar 7 avr. 2015 07:26, Manish Gupta 8 mgupt...@sapient.com a écrit :

  Hi,



 We are trying to build a Play framework based web application integrated
 with Apache Spark. We are running *Apache Spark 1.2.0 CDH 5.3.0*. But
 struggling with akka version conflicts (errors like
 java.lang.NoSuchMethodError in akka). We have tried Play 2.2.6 as well as
 Activator 1.3.2.



 If anyone has successfully integrated Spark 1.2.0 with Play/Activator,
 please share the version we should use and akka dependencies we should add
 in Build.sbt.



 Thanks,

 Manish



Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread andy petrella
That purely awesome! Don't hesitate to contribute your notebook back to the
spark notebook repo, even rough, I'll help cleaning up if needed.

The vagrant is also appealing 

Congrats!

Le jeu 26 mars 2015 22:22, David Holiday dav...@annaisystems.com a écrit :

  w0t! that did it! t/y so much!

  I'm going to put together a pastebin or something that has all the code
 put together so if anyone else runs into this issue they will have some
 working code to help them figure out what's going on.

 DAVID HOLIDAY
  Software Engineer
  760 607 3300 | Office
  312 758 8385 | Mobile
  dav...@annaisystems.com broo...@annaisystems.com



 www.AnnaiSystems.com

  On Mar 26, 2015, at 12:24 PM, Corey Nolet cjno...@gmail.com wrote:

  Spark uses a SerializableWritable [1] to java serialize writable
 objects. I've noticed (at least in Spark 1.2.1) that it breaks down with
 some objects when Kryo is used instead of regular java serialization.
 Though it is  wrapping the actual AccumuloInputFormat (another example of
 something you may want to do in the future), we have Accumulo working to
 load data from a table into Spark SQL [2]. The way Spark uses the
 InputFormat is very straightforward.

  [1]
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala
 [2]
 https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStoreCatalyst.scala#L76

 On Thu, Mar 26, 2015 at 3:06 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 I'm guessing the Accumulo Key and Value classes are not serializable, so
 you would need to do something like

  val rdd = sc.newAPIHadoopRDD(...).map { case (key, value) =
 (extractScalaType(key), extractScalaType(value)) }

  Where 'extractScalaType converts the key or Value to a standard Scala
 type or case class or whatever - basically extracts the data from the Key
 or Value in a form usable in Scala

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


   On Thu, Mar 26, 2015 at 8:59 PM, Russ Weeks rwe...@newbrightidea.com
 wrote:

   Hi, David,

  This is the code that I use to create a JavaPairRDD from an Accumulo
 table:

  JavaSparkContext sc = new JavaSparkContext(conf);
 Job hadoopJob = Job.getInstance(conf,TestSparkJob);
 job.setInputFormatClass(AccumuloInputFormat.class);
 AccumuloInputFormat.setZooKeeperInstance(job,
 conf.get(ZOOKEEPER_INSTANCE_NAME,
 conf.get(ZOOKEEPER_HOSTS)
 );
 AccumuloInputFormat.setConnectorInfo(job,
 conf.get(ACCUMULO_AGILE_USERNAME),
 new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD))
 );
 AccumuloInputFormat.setInputTableName(job,
 conf.get(ACCUMULO_TABLE_NAME));
 AccumuloInputFormat.setScanAuthorizations(job, auths);
 JavaPairRDDKey, Value values =
 sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class,
 Key.class, Value.class);

  Key.class and Value.class are from org.apache.accumulo.core.data. I
 use a WholeRowIterator so that the Value is actually an encoded
 representation of an entire logical row; it's a useful convenience if you
 can be sure that your rows always fit in memory.

  I haven't tested it since Spark 1.0.1 but I doubt anything important
 has changed.

  Regards,
 -Russ


  On Thu, Mar 26, 2015 at 11:41 AM, David Holiday 
 dav...@annaisystems.com wrote:

   * progress!*

 i was able to figure out why the 'input INFO not set' error was
 occurring. the eagle-eyed among you will no doubt see the following code is
 missing a closing '('

 AbstractInputFormat.setConnectorInfo(jobConf, root, new 
 PasswordToken(password)

 as I'm doing this in spark-notebook, I'd been clicking the execute
 button and moving on because I wasn't seeing an error. what I forgot was
 that notebook is going to do what spark-shell will do when you leave off a
 closing ')' -- *it will wait forever for you to add it*. so the error
 was the result of the 'setConnectorInfo' method never getting executed.

 unfortunately, I'm still unable to shove the accumulo table data into
 an RDD that's useable to me. when I execute

 rddX.count

 I get back

 res15: Long = 1

 which is the correct response - there are 10,000 rows of data in the
 table I pointed to. however, when I try to grab the first element of data
 thusly:

 rddX.first

 I get the following error:

 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0.0 in stage 0.0 (TID 0) had a not serializable result:
 org.apache.accumulo.core.data.Key

 any thoughts on where to go from here?
 DAVID HOLIDAY
  Software Engineer
  760 607 3300 | Office
  312 758 8385 | Mobile
  dav...@annaisystems.com broo...@annaisystems.com


  GetFileAttachment.jpg

 www.AnnaiSystems.com http://www.annaisystems.com/

On Mar 26, 2015, at 8:35 AM, David Holiday dav...@annaisystems.com
 wrote:

  hi Nick

  Unfortunately the Accumulo docs are woefully inadequate, and in some
 places, flat wrong. I'm not sure if this is a case 

Re: Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread andy petrella
Hello Adamantios,

Thanks for the poke and the interest.
Actually, you're the second asking about backporting it. Yesterday (late),
I created a branch for it... and the simple local spark test worked! \o/.
However, it'll be the 'old' UI :-/. Since I didn't ported the code using
play 2.2.6 to the new ui.
FYI: play 2.2.6 uses a compliant akka version, that's why I mention it.

It was too late for a push :-D, so I'll commit and push this evening.
At least, you can try it tomorrow.

I shall be on gitter this evening as well if there are questions:
https://gitter.im/andypetrella/spark-notebook

Cheers,
andy

On Tue Feb 03 2015 at 2:05:35 PM Adamantios Corais 
adamantios.cor...@gmail.com wrote:

 Hi,

 I am using Spark 0.9.1 and I am looking for a proper viz tools that
 supports that specific version. As far as I have seen all relevant tools
 (e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no
 mentions about older versions of Spark. Any ideas or suggestions?


 *// Adamantios*





Re: Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread andy petrella
Adamantios,

As said, I backported it to 0.9.x and now it's pushed on this branch:
https://github.com/andypetrella/spark-notebook/tree/spark-0.9.x.

I didn't created dist atm, because I'd prefer to do it only if necessary
:-).
So, if you want to try it out, just clone the repo, checked out in this
branch and launch `sbt run`.

HTH,
andy

On Tue Feb 03 2015 at 2:45:43 PM andy petrella andy.petre...@gmail.com
wrote:

 Hello Adamantios,

 Thanks for the poke and the interest.
 Actually, you're the second asking about backporting it. Yesterday (late),
 I created a branch for it... and the simple local spark test worked! \o/.
 However, it'll be the 'old' UI :-/. Since I didn't ported the code using
 play 2.2.6 to the new ui.
 FYI: play 2.2.6 uses a compliant akka version, that's why I mention it.

 It was too late for a push :-D, so I'll commit and push this evening.
 At least, you can try it tomorrow.

 I shall be on gitter this evening as well if there are questions:
 https://gitter.im/andypetrella/spark-notebook

 Cheers,
 andy

 On Tue Feb 03 2015 at 2:05:35 PM Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Hi,

 I am using Spark 0.9.1 and I am looking for a proper viz tools that
 supports that specific version. As far as I have seen all relevant tools
 (e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no
 mentions about older versions of Spark. Any ideas or suggestions?


 *// Adamantios*





Re: Using TF-IDF from MLlib

2014-12-29 Thread andy petrella
Here is what I did for this case : https://github.com/andypetrella/tf-idf

Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :

 Given (label, terms) you can just transform the values to a TF vector,
 then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
 make a LabeledPoint from (label, vector) pairs. Is that what you're
 looking for?

 On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
  I found the TF-IDF feature extraction and all the MLlib code that work
 with
  pure Vector RDD very difficult to work with due to the lack of ability to
  associate vector back to the original data. Why can't Spark MLlib support
  LabeledPoint?
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread andy petrella
Nice idea, although it needs a plan on their hosting, or spark to host it
if I'm not wrong.

I've been using Slack for discussions, it's not exactly the same of
discourse, the ML or SO but offers interesting features.
It's more in the mood of IRC integrated with external services.

my2c

On Wed Dec 24 2014 at 21:50:48 Nick Chammas nicholas.cham...@gmail.com
wrote:

 When people have questions about Spark, there are 2 main places (as far as
 I can tell) where they ask them:

- Stack Overflow, under the apache-spark tag
http://stackoverflow.com/questions/tagged/apache-spark
- This mailing list

 The mailing list is valuable as an independent place for discussion that
 is part of the Spark project itself. Furthermore, it allows for a broader
 range of discussions than would be allowed on Stack Overflow
 http://stackoverflow.com/help/dont-ask.

 As the Spark project has grown in popularity, I see that a few problems
 have emerged with this mailing list:

- It’s hard to follow topics (e.g. Streaming vs. SQL) that you’re
interested in, and it’s hard to know when someone has mentioned you
specifically.
- It’s hard to search for existing threads and link information across
disparate threads.
- It’s hard to format code and log snippets nicely, and by extension,
hard to read other people’s posts with this kind of information.

 There are existing solutions to all these (and other) problems based
 around straight-up discipline or client-side tooling, which users have to
 conjure up for themselves.

 I’d like us as a community to consider using Discourse
 http://www.discourse.org/ as an alternative to, or overlay on top of,
 this mailing list, that provides better out-of-the-box solutions to these
 problems.

 Discourse is a modern discussion platform built by some of the same people
 who created Stack Overflow. It has many neat features
 http://v1.discourse.org/about/ that I believe this community would
 benefit from.

 For example:

- When a user starts typing up a new post, they get a panel *showing
existing conversations that look similar*, just like on Stack Overflow.
- It’s easy to search for posts and link between them.
- *Markdown support* is built-in to composer.
- You can *specifically mention people* and they will be notified.
- Posts can be categorized (e.g. Streaming, SQL, etc.).
- There is a built-in option for mailing list support which forwards
all activity on the forum to a user’s email address and which allows for
creation of new posts via email.

 What do you think of Discourse as an alternative, more manageable way to
 discus Spark?

 There are a few options we can consider:

1. Work with the ASF as well as the Discourse team to allow Discourse
to act as an overlay on top of this mailing list

 https://meta.discourse.org/t/discourse-as-a-front-end-for-existing-asf-mailing-lists/23167?u=nicholaschammas,
allowing people to continue to use the mailing list as-is if they want.
(This is the toughest but perhaps most attractive option.)
2. Create a new Discourse forum for Spark that is not bound to this
user list. This is relatively easy but will effectively fork the community
on this list. (We cannot shut down this mailing in favor of one managed by
Discourse.)
3. Don’t use Discourse. Just encourage people on this list to post
instead on Stack Overflow whenever possible.
4. Something else.

 What does everyone think?

 Nick
 ​

 --
 View this message in context: Discourse: A proposed alternative to the
 Spark User list
 http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-21 Thread andy petrella
Actually yes, things like interactive notebooks f.i.

On Sun Dec 21 2014 at 11:35:18 AM Sean Owen so...@cloudera.com wrote:

 I'm only speculating, but I wonder if it was on purpose? would people
 ever build an app against the REPL?

 On Sun, Dec 21, 2014 at 5:50 AM, Peng Cheng pc...@uow.edu.au wrote:
  Everything else is there except spark-repl. Can someone check that out
 this
  weekend?
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-
 to-central-maven-repository-tp20799.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread andy petrella
NP man,

The thing is that since you're in a dist env, it'd be cumbersome to do
that. Remember that Spark works basically on block/partition, they are the
unit of distribution and parallelization.
That means that actions have to be run against it **after having been
scheduled on the cluster**.
The latter point is the most important, it means that the RDD aren't
really created on the driver the collection is created/transformed/... on
the partition.
Consequence of what you cannot, on the driver, create such representation
on the distributed collection since you haven't seen it yet.
That being said, you can only prepare/define some computations on the
driver that will segregate the data by applying a filter on the nodes.
If you want to keep RDD operators as they are, yes you'll need to pass over
the distributed data twice.

The option of using the mapPartitions for instance, will be to create a
RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might
have to repartition otherwise the OOMs might blow at your face :-D.
I won't pick that one!


A final note: looping over the data is not that a problem (specially if you
can cache it), and in fact it's way better to keep advantage of resilience
etc etc that comes with Spark.

my2c
andy


On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi Andy,  thanks for your response. I already thought about filtering
 twice, that was what I meant with that would be equivalent to applying
 filter twice, but I was thinking if I could do it in a single pass, so
 that could be later generalized to an arbitrary numbers of classes. I would
 also like to be able to generate RDDs instead of partitions of a single
 RDD, so I could use RDD methods like stats() on the fragments. But I think
 there is currently no RDD method that returns more than one RDD for a
 single input RDD, so maybe there is some design limitation on Spark that
 prevents this?

 Again, thanks for your answer.

 Greetings,

 Juan
 El 17/12/2014 18:15, andy petrella andy.petre...@gmail.com escribió:

 yo,

 First, here is the scala version:
 http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
 Boolean):(Repr,Repr)

 Second: RDD is distributed so what you'll have to do is to partition each
 partition each partition (:-D) or create two RDDs with by filtering twice →
 hence tasks will be scheduled distinctly, and data read twice. Choose
 what's best for you!

 hth,
 andy


 On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi all,

 I would like to be able to split a RDD in two pieces according to a
 predicate. That would be equivalent to applying filter twice, with the
 predicate and its complement, which is also similar to Haskell's partition
 list function (
 http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
 There is currently any way to do this in Spark?, or maybe anyone has a
 suggestion about how to implent this by modifying the Spark source. I think
 this is valuable because sometimes I need to split a RDD in several groups
 that are too big to fit in the memory of a single thread, so pair RDDs are
 not solution for those cases. A generalization to n parts of Haskell's
 partition would do the job.

 Thanks a lot for your help.

 Greetings,

 Juan Rodriguez




Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread andy petrella
yo,

First, here is the scala version:
http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
Boolean):(Repr,Repr)

Second: RDD is distributed so what you'll have to do is to partition each
partition each partition (:-D) or create two RDDs with by filtering twice →
hence tasks will be scheduled distinctly, and data read twice. Choose
what's best for you!

hth,
andy


On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi all,

 I would like to be able to split a RDD in two pieces according to a
 predicate. That would be equivalent to applying filter twice, with the
 predicate and its complement, which is also similar to Haskell's partition
 list function (
 http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
 There is currently any way to do this in Spark?, or maybe anyone has a
 suggestion about how to implent this by modifying the Spark source. I think
 this is valuable because sometimes I need to split a RDD in several groups
 that are too big to fit in the memory of a single thread, so pair RDDs are
 not solution for those cases. A generalization to n parts of Haskell's
 partition would do the job.

 Thanks a lot for your help.

 Greetings,

 Juan Rodriguez



Re: Is Spark the right tool for me?

2014-12-02 Thread andy petrella
The point 4 looks weird to me, I mean if you intent to have such workflow
to run in a single session (maybe consider sessionless arch)
Is such process for each user? If it's the case, maybe finding a way to do
it for all at once would be better (more data but less scheduling).

For the micro updates, considering something like a queue (kestrel? or even
kafk... whatever, something that works) would be great. So you remove the
load off the instances, and the updates can be done at its own pace. Also,
you can reuse it to notify the WMS.
Isn't there a way to do tiling directly? Also, do you need indexes, I mean
do you need the full OGIS power, or just some classical operators are
enough (using BBox only for instance)?

The more you can simplify the better :-D.

These are only my2c, it's hard to think or react appropriately without
knowing the whole context.
BTW, to answer your very first question: yes, it looks like Spark will help
you!

cheers,
andy



On Mon Dec 01 2014 at 4:36:44 PM Stadin, Benjamin 
benjamin.sta...@heidelberg-mobil.com wrote:

 Yes, the processing causes the most stress. But this is parallizeable by
 splitting the input source. My problem is that once the heavy preprocessing
 is done, I’m in a „micro-update“ mode so to say (user-interactive part of
 the whole workflow). Then the map is rendered directly from the SQLite file
 by the map server instance on that machine – this is actually a favorable
 setup for me for resource consumption and implementation costs (I just need
 to tell the web ui to refresh after something was written to the db, and
 the map server will render the updates without me changing / coding
 anything). So my workflow requires to break out of parallel processing for
 some time.

 Do you think for my my generalized workflow and tool chain can be like so?

1. Pre-Process many files in a parallel way. Gather all results,
deploy them on one single machine. = Spark coalesce() + Crunch (for
splitting input files into separate tasks)
2. On the machine where preprocessed results are on, configure a map
server to connect to the local SQLite source. Do user-interactive
micro-updates on that file (web UI gets updated).
3. Post-process the files in parallel. = Spark + Crunch
4. Design all of the above as a workflow, runnable (or assignable) as
part of a user session. = Oozie

 Do you think this is ok?

 ~Ben


 Von: andy petrella andy.petre...@gmail.com
 Datum: Montag, 1. Dezember 2014 15:48

 An: Benjamin Stadin benjamin.sta...@heidelberg-mobil.com, 
 user@spark.apache.org user@spark.apache.org
 Betreff: Re: Is Spark the right tool for me?

 Indeed. However, I guess the important load and stress is in the
 processing of the 3D data (DEM or alike) into geometries/shades/whatever.
 Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob
 for more info) to perform these operations then keep an RDD of only the
 resulting geometries.
 Those geometries won't probably that heavy, hence it might be possible to
 coalesce(1, true) to have to whole thing on one node (or if your driver is
 more beefy, do a collect/foreach) to create the index.
 You could also create a GeoJSON of the geometries and create the r-tree on
 it (not sure about this one).



 On Mon Dec 01 2014 at 3:38:00 PM Stadin, Benjamin 
 benjamin.sta...@heidelberg-mobil.com wrote:

 Thank you for mentioning GeoTrellis. I haven’t heard of this before. We
 have many custom tools and steps, I’ll check our tools fit in. The end
 result after is actually a 3D map for native OpenGL based rendering on iOS
 / Android [1].

 I’m using GeoPackage which is basically SQLite with R-Tree and a small
 library around it (more lightweight than SpatialLite). I want to avoid
 accessing the SQLite db from any other machine or task, that’s where I
 thought I can use a long running task which is the only process responsible
 to update a local-only stored SQLite db file. As you also said SQLite  (or
 mostly any other file based db) won’t work well over network. This isn’t
 only limited to R-Tree but expected limitation because of file locking
 issues as documented also by SQLite.

 I also thought to do the same thing when rendering the (web) maps. In
 combination with the db handler which does the actual changes, I thought to
 run a map server instance on each node, configure it to add the database
 location as map source once the task starts.

 Cheers
 Ben

 [1] http://www.deep-map.com

 Von: andy petrella andy.petre...@gmail.com
 Datum: Montag, 1. Dezember 2014 15:07
 An: Benjamin Stadin benjamin.sta...@heidelberg-mobil.com, 
 user@spark.apache.org user@spark.apache.org
 Betreff: Re: Is Spark the right tool for me?

 Not quite sure which geo processing you're doing are they raster, vector? 
 More
 info will be appreciated for me to help you further.

 Meanwhile I can try to give some hints, for instance, did you considered
 GeoMesa http://www.geomesa.org/2014/08/05/spark/?
 Since you need

Re: Is Spark the right tool for me?

2014-12-02 Thread andy petrella
You might also have to check the Spark JobServer, it could help you at some
point.

On Tue Dec 02 2014 at 12:29:01 PM Stadin, Benjamin 
benjamin.sta...@heidelberg-mobil.com wrote:

 To be precise I want the workflow to be associated to a user, but it
 doesn’t need to be run as part of or depend on a session. I can’t run
 scheduled jobs, because a user can potentially upload hundreds of files
 which trigger a long running batch import / update process but he could
 also make a very small upload / update and immediately wants to continue to
 work on the (temporary) data that he just uploaded. So that same workflow
 duration may vary between some seconds, a minute and hours, completely
 depending on the project's size.

 So a user can log off and on again to the web site and the initial upload
 + conversion step may either be still running or finished. He’ll see the
 progress on the web site, and once the initial processing is done he can
 continue with the next step of the import workflow, he can interactively
 change some stuff on that temporary data. After he is done changing stuff,
 he can hit a „continue“ button which triggers again a long or short running
 post-processing pipe. Then the user can make a final review of that now
 post-processed data, and after hitting a „save“ button a final commits pipe
 pushes / merges the until now temporary data to some persistent store.

 You’re completely right about that I should simplify as much as possible.
 Finding the right mix seems key. I’ve also considered to use Kafka to
 message between Web UI and the pipes, I think it will fit. Chaining the
 pipes together as a workflow and implementing, managing and monitoring
 these long running user tasks with locality  as I need them is still
 causing me headache.

 Btw, the tiling and indexing is not a problem. My propblem is mainly in
 parallelized conversion, polygon creation, cleaning of CAD file data (e.g.
 GRASS, prepair, custom tools). After all parts have been preprocessed and
 gathered in one place, the initial creation of the preview geo file is
 taking a fraction of the time (inserting all data in one transaction,
 taking somewhere between sub-second and  10 seconds for very large
 projects). It’s currently not a concern.

 (searching for a Kafka+Spark example now)

 Cheers
 Ben


 Von: andy petrella andy.petre...@gmail.com
 Datum: Dienstag, 2. Dezember 2014 10:00

 An: Benjamin Stadin benjamin.sta...@heidelberg-mobil.com, 
 user@spark.apache.org user@spark.apache.org
 Betreff: Re: Is Spark the right tool for me?

 The point 4 looks weird to me, I mean if you intent to have such workflow
 to run in a single session (maybe consider sessionless arch)
 Is such process for each user? If it's the case, maybe finding a way to
 do it for all at once would be better (more data but less scheduling).

 For the micro updates, considering something like a queue (kestrel? or
 even kafk... whatever, something that works) would be great. So you remove
 the load off the instances, and the updates can be done at its own pace.
 Also, you can reuse it to notify the WMS.
 Isn't there a way to do tiling directly? Also, do you need indexes, I mean
 do you need the full OGIS power, or just some classical operators are
 enough (using BBox only for instance)?

 The more you can simplify the better :-D.

 These are only my2c, it's hard to think or react appropriately without
 knowing the whole context.
 BTW, to answer your very first question: yes, it looks like Spark will
 help you!

 cheers,
 andy



 On Mon Dec 01 2014 at 4:36:44 PM Stadin, Benjamin 
 benjamin.sta...@heidelberg-mobil.com wrote:

 Yes, the processing causes the most stress. But this is parallizeable by
 splitting the input source. My problem is that once the heavy preprocessing
 is done, I’m in a „micro-update“ mode so to say (user-interactive part of
 the whole workflow). Then the map is rendered directly from the SQLite file
 by the map server instance on that machine – this is actually a favorable
 setup for me for resource consumption and implementation costs (I just need
 to tell the web ui to refresh after something was written to the db, and
 the map server will render the updates without me changing / coding
 anything). So my workflow requires to break out of parallel processing for
 some time.

 Do you think for my my generalized workflow and tool chain can be like so?

1. Pre-Process many files in a parallel way. Gather all results,
deploy them on one single machine. = Spark coalesce() + Crunch (for
splitting input files into separate tasks)
2. On the machine where preprocessed results are on, configure a map
server to connect to the local SQLite source. Do user-interactive
micro-updates on that file (web UI gets updated).
3. Post-process the files in parallel. = Spark + Crunch
4. Design all of the above as a workflow, runnable (or assignable) as
part of a user session. = Oozie

 Do you think this is ok?

 ~Ben


 Von

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
Not quite sure which geo processing you're doing are they raster, vector? More
info will be appreciated for me to help you further.

Meanwhile I can try to give some hints, for instance, did you considered
GeoMesa http://www.geomesa.org/2014/08/05/spark/?
Since you need a WMS (or alike), did you considered GeoTrellis
http://geotrellis.io/ (go to the batch processing)?

When you say SQLite, you mean that you're using Spatialite? Or your db is
not a geo one, and it's simple SQLite. In case you need an r-tree (or
related) index, you're headaches will come from congestion within your
database transaction... unless you go to a dedicated database like Vertica
(just mentioning)

kr,
andy



On Mon Dec 01 2014 at 2:49:44 PM Stadin, Benjamin 
benjamin.sta...@heidelberg-mobil.com wrote:

 Hi all,

 I need some advise whether Spark is the right tool for my zoo. My
 requirements share commonalities with „big data“, workflow coordination and
 „reactive“ event driven data processing (as in for example Haskell Arrows),
 which doesn’t make it any easier to decide on a tool set.

 NB: I have asked a similar question on the Storm mailing list, but have
 been deferred to Spark. I previously thought Storm was closer to my needs –
 but maybe neither is.

 To explain my needs it’s probably best to give an example scenario:

- A user uploads small files (typically 1-200 files, file size
typically 2-10MB per file)
- Files should be converted in parallel and on available nodes. The
conversion is actually done via native tools, so there is not so much big
data processing required, but dynamic parallelization (so for example to
split the conversion step into as many conversion tasks as files are
available). The conversion typically takes between several minutes and a
few hours.
- The converted files gathered and are stored in a single database
(containing geometries for rendering)
- Once the db is ready, a web map server is (re-)configured and the
user can make small updates to the data set via a web UI.
- … Some other data processing steps which I leave away for brevity …
- There will be initially only a few concurrent users, but the system
shall be able to scale if needed

 My current thoughts:

- I should avoid to upload files into the distributed storage during
conversion, but probably should rather have each conversion filter download
the file it is actually converting from a shared place. Other wise it’s bad
for scalability reasons (too many redundant copies of same temporary files
if there are many concurrent users and many cluster nodes).
- Apache Oozie seems an option to chain together my pipes into a
workflow. But is it a good fit with Spark? What options do I have with
Spark to chain a workflow from pipes?
- Apache Crunch seems to make it easy to dynamically parallelize tasks
(Oozie itself can’t do this). But I may not need crunch after all if I have
Spark, and it also doesn’t seem to fit to my last problem following.
- The part that causes me the most headache is the user interactive db
update: I consider to use Kafka as message bus to broker between the web UI
and a custom db handler (nb, the db is a SQLite file). But how about
update responsiveness, isn’t it that Spark will cause some lags (as opposed
to Storm)?
- The db handler probably has to be implemented as a long running
continuing task, so when a user sends some changes the handler writes these
to the db file. However, I want this to be decoupled from the job. So file
these updates should be done locally only on the machine that started the
job for the whole lifetime of this user interaction. Does Spark allow to
create such long running tasks dynamically, so that when another (web) user
starts a new task a new long–running task is created and run on the same
node, which eventually ends and triggers the next task? Also, is it
possible to identify a running task, so that a long running task can be
bound to a session (db handler working on local db updates, until task
done), and eventually restarted / recreated on failure?


 ~Ben



Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
Indeed. However, I guess the important load and stress is in the processing
of the 3D data (DEM or alike) into geometries/shades/whatever.
Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob
for more info) to perform these operations then keep an RDD of only the
resulting geometries.
Those geometries won't probably that heavy, hence it might be possible to
coalesce(1, true) to have to whole thing on one node (or if your driver is
more beefy, do a collect/foreach) to create the index.
You could also create a GeoJSON of the geometries and create the r-tree on
it (not sure about this one).



On Mon Dec 01 2014 at 3:38:00 PM Stadin, Benjamin 
benjamin.sta...@heidelberg-mobil.com wrote:

 Thank you for mentioning GeoTrellis. I haven’t heard of this before. We
 have many custom tools and steps, I’ll check our tools fit in. The end
 result after is actually a 3D map for native OpenGL based rendering on iOS
 / Android [1].

 I’m using GeoPackage which is basically SQLite with R-Tree and a small
 library around it (more lightweight than SpatialLite). I want to avoid
 accessing the SQLite db from any other machine or task, that’s where I
 thought I can use a long running task which is the only process responsible
 to update a local-only stored SQLite db file. As you also said SQLite  (or
 mostly any other file based db) won’t work well over network. This isn’t
 only limited to R-Tree but expected limitation because of file locking
 issues as documented also by SQLite.

 I also thought to do the same thing when rendering the (web) maps. In
 combination with the db handler which does the actual changes, I thought to
 run a map server instance on each node, configure it to add the database
 location as map source once the task starts.

 Cheers
 Ben

 [1] http://www.deep-map.com

 Von: andy petrella andy.petre...@gmail.com
 Datum: Montag, 1. Dezember 2014 15:07
 An: Benjamin Stadin benjamin.sta...@heidelberg-mobil.com, 
 user@spark.apache.org user@spark.apache.org
 Betreff: Re: Is Spark the right tool for me?

 Not quite sure which geo processing you're doing are they raster, vector? More
 info will be appreciated for me to help you further.

 Meanwhile I can try to give some hints, for instance, did you considered
 GeoMesa http://www.geomesa.org/2014/08/05/spark/?
 Since you need a WMS (or alike), did you considered GeoTrellis
 http://geotrellis.io/ (go to the batch processing)?

 When you say SQLite, you mean that you're using Spatialite? Or your db is
 not a geo one, and it's simple SQLite. In case you need an r-tree (or
 related) index, you're headaches will come from congestion within your
 database transaction... unless you go to a dedicated database like Vertica
 (just mentioning)

 kr,
 andy



 On Mon Dec 01 2014 at 2:49:44 PM Stadin, Benjamin 
 benjamin.sta...@heidelberg-mobil.com wrote:

 Hi all,

 I need some advise whether Spark is the right tool for my zoo. My
 requirements share commonalities with „big data“, workflow coordination and
 „reactive“ event driven data processing (as in for example Haskell Arrows),
 which doesn’t make it any easier to decide on a tool set.

 NB: I have asked a similar question on the Storm mailing list, but have
 been deferred to Spark. I previously thought Storm was closer to my needs –
 but maybe neither is.

 To explain my needs it’s probably best to give an example scenario:

- A user uploads small files (typically 1-200 files, file size
typically 2-10MB per file)
- Files should be converted in parallel and on available nodes. The
conversion is actually done via native tools, so there is not so much big
data processing required, but dynamic parallelization (so for example to
split the conversion step into as many conversion tasks as files are
available). The conversion typically takes between several minutes and a
few hours.
- The converted files gathered and are stored in a single database
(containing geometries for rendering)
- Once the db is ready, a web map server is (re-)configured and the
user can make small updates to the data set via a web UI.
- … Some other data processing steps which I leave away for brevity …
- There will be initially only a few concurrent users, but the system
shall be able to scale if needed

 My current thoughts:

- I should avoid to upload files into the distributed storage during
conversion, but probably should rather have each conversion filter 
 download
the file it is actually converting from a shared place. Other wise it’s 
 bad
for scalability reasons (too many redundant copies of same temporary files
if there are many concurrent users and many cluster nodes).
- Apache Oozie seems an option to chain together my pipes into a
workflow. But is it a good fit with Spark? What options do I have with
Spark to chain a workflow from pipes?
- Apache Crunch seems to make it easy to dynamically parallelize
tasks (Oozie itself

Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
Actually, it's a real

On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for
 one solution.

 One issue with those XML files is that they cannot be processed line by
 line in parallel; plus you inherently need shared/global state to parse XML
 or check for well-formedness, I think. (Same issue with multi-line JSON, by
 the way.)

 Tobias




Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
(sorry about the previous spam... google inbox didn't allowed me to cancel
the miserable sent action :-/)

So what I was about to say: it's a real PAIN tin the ass to parse the
wikipedia articles in the dump due to this mulitline articles...

However, there is a way to manage that quite easily, although I found it
rather slow.

*1/ use XML reader*
Use the org.apache.hadoop % hadoop-streaming % 1.0.4

*2/ configure the hadoop job*
import org.apache.hadoop.streaming.StreamXmlRecordReader
import org.apache.hadoop.mapred.JobConf
val jobConf = new JobConf()
jobConf.set(stream.recordreader.class,
org.apache.hadoop.streaming.StreamXmlRecordReader)
jobConf.set(stream.recordreader.begin, page)
jobConf.set(stream.recordreader.end, /page)
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf,
shdfs://$master:9000/data.xml)

// Load documents (one per line).
val documents = sparkContext.hadoopRDD(jobConf,
classOf[org.apache.hadoop.streaming.StreamInputFormat],
classOf[org.apache.hadoop.io.Text],
classOf[org.apache.hadoop.io.Text])


*3/ use the result as XML doc*
import scala.xml.XML
val texts = documents.map(_._1.toString)
 .map{ s =
   val xml = XML.loadString(s)
   val id = (xml \ id).text.toDouble
   val title = (xml \ title).text
   val text = (xml \ revision \
text).text.replaceAll(\\W,  )
   val tknzed = text.split(\\W).filter(_.size 
3).toList
   (id, title, tknzed )
 }

HTH
andy
On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for
 one solution.

 One issue with those XML files is that they cannot be processed line by
 line in parallel; plus you inherently need shared/global state to parse XML
 or check for well-formedness, I think. (Same issue with multi-line JSON, by
 the way.)

 Tobias




Re: Using TF-IDF from MLlib

2014-11-21 Thread andy petrella
Yeah, I initially used zip but I was wondering how reliable it is. I mean,
it's the order guaranteed? What if some mode fail, and the data is pulled
out from different nodes?
And even if it can work, I found this implicit semantic quite
uncomfortable, don't you?

My0.2c

Le ven 21 nov. 2014 15:26, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com
a écrit :

 Thanks for the info Andy. A big help.

 One thing - I think you can figure out which document is responsible for
 which vector without checking in more code.
 Start with a PairRDD of [doc_id, doc_string] for each document and split
 that into one RDD for each column.
 The values in the doc_string RDD get split and turned into a Seq and fed
 to TFIDF.
 You can take the resulting RDD[Vector]s and zip them with the doc_id RDD.
 Presto!

 Best regards,
 Ron






Re: Getting spark job progress programmatically

2014-11-20 Thread andy petrella
Awesome! And Patrick just gave his LGTM ;-)

On Wed Nov 19 2014 at 5:13:17 PM Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 Thanks for pointing this out Mark. Had totally missed the existing JIRA
 items

 On Wed Nov 19 2014 at 21:42:19 Mark Hamstra m...@clearstorydata.com
 wrote:

 This is already being covered by SPARK-2321 and SPARK-4145.  There are
 pull requests that are already merged or already very far along -- e.g.,
 https://github.com/apache/spark/pull/3009

 If there is anything that needs to be added, please add it to those
 issues or PRs.

 On Wed, Nov 19, 2014 at 7:55 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I have for now submitted a JIRA ticket @
 https://issues.apache.org/jira/browse/SPARK-4473. I will collate all my
 experiences ( hacks) and submit them as a feature request for public API.

 On Tue Nov 18 2014 at 20:35:00 andy petrella andy.petre...@gmail.com
 wrote:

 yep, we should also propose to add this stuffs in the public API.

 Any other ideas?

 On Tue Nov 18 2014 at 4:03:35 PM Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 Thanks Andy. This is very useful. This gives me all active stages 
 their percentage completion but I am unable to tie stages to job group (or
 specific job). I looked at Spark's code and to me, it
 seems org.apache.spark.scheduler.ActiveJob's group ID should get 
 propagated
 to StageInfo (possibly in the StageInfo.fromStage method). For now, I will
 have to write my own version JobProgressListener that stores stageId to
 group Id mapping.

 I will submit a JIRA ticket and seek spark dev's opinion on this. Many
 thanks for your prompt help Andy.

 Thanks,
 Aniket


 On Tue Nov 18 2014 at 19:40:06 andy petrella andy.petre...@gmail.com
 wrote:

 I started some quick hack for that in the notebook, you can head to:
 https://github.com/andypetrella/spark-notebook/
 blob/master/common/src/main/scala/notebook/front/widgets/
 SparkInfo.scala

 On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am writing yet another Spark job server and have been able to
 submit jobs and return/save results. I let multiple jobs use the same 
 spark
 context but I set job group while firing each job so that I can in 
 future
 cancel jobs. Further, what I deserve to do is provide some kind of 
 status
 update/progress on running jobs (a % completion but be awesome) but I am
 unable to figure out appropriate spark API to use. I do however see 
 status
 reporting in spark UI so there must be a way to get status of various
 stages per job group. Any hints on what APIs should I look at?





[GraphX] Mining GeoData (OSM)

2014-11-20 Thread andy petrella
Guys,

After talking with Ankur, it turned out that sharing the talk we gave at
ScalaIO (France) would be worthy.
So there you go, and don't hesitate to share your thoughts ;-)/

http://www.slideshare.net/noootsab/machine-learning-and-graphx

Greetz,
andy


Re: Using TF-IDF from MLlib

2014-11-20 Thread andy petrella
/Someone will correct me if I'm wrong./

Actually, TF-IDF scores terms for a given document, an specifically TF.
Internally, these things are holding a Vector (hopefully sparsed)
representing all the possible words (up to 2²⁰) per document. So each
document afer applying TF, will be transformed in a Vector. `indexOf` gives
the index in the latter Vector.

So you can ask the frequency for all the terms in *a doc* by looping on the
doc's terms and ask for the value hold in the vector at the place returned
by indexOf.

The problem you'll face in this case is that with the current
implementation it's hard to retrieve the document back. 'Cause the result
you'll have is only RDD[Vector]... so which item in your RDD is actually
the document you want?
I faced the same problem (for a demo I did at devoxx on the wikipedia
data), hence I've updated in a repo the code of TF-IDF to allow it to hold
a reference to the original document.
https://github.com/andypetrella/TF-IDF

If you use this impl (which I need to find some time to integrate in spark
:-/ ) you'll can build a pair RDD consisting (Path, Vector) for instance.
Then this pair RDD can be search (filter + take) for the doc you need and
finally asking for the freq (or even after the tfidf score)

HTH

andy





On Thu Nov 20 2014 at 1:14:24 AM Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

  Hi all,



 I want to try the TF-IDF functionality in MLlib.

 I can feed it words and generate the tf and idf  RDD[Vector]s, using the
 code below.

 But how do I get this back to words and their counts and tf-idf values for
 presentation?





 val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)

 val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split( ).toSeq)

 val hashingTF = new HashingTF()

 val tf: RDD[Vector] = hashingTF.transform(documents)

 tf.cache()

 val idf = new IDF().fit(tf)

 val tfidf: RDD[Vector] = idf.transform(tf)



 It looks like I can get the indices of the terms using something like



 J = wordListRDD.map(w = hashingTF.indexOf(w))



 where wordList is an RDD holding the distinct words from the sequence of
 words used to come up with tf.

 But how do I do the equivalent of



 Counts  = J.map(j = tf.counts(j))  ?



 Thanks,

 Ron





Re: Getting spark job progress programmatically

2014-11-18 Thread andy petrella
I started some quick hack for that in the notebook, you can head to:
https://github.com/andypetrella/spark-notebook/blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala

On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 I am writing yet another Spark job server and have been able to submit
 jobs and return/save results. I let multiple jobs use the same spark
 context but I set job group while firing each job so that I can in future
 cancel jobs. Further, what I deserve to do is provide some kind of status
 update/progress on running jobs (a % completion but be awesome) but I am
 unable to figure out appropriate spark API to use. I do however see status
 reporting in spark UI so there must be a way to get status of various
 stages per job group. Any hints on what APIs should I look at?


Re: Getting spark job progress programmatically

2014-11-18 Thread andy petrella
yep, we should also propose to add this stuffs in the public API.

Any other ideas?

On Tue Nov 18 2014 at 4:03:35 PM Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 Thanks Andy. This is very useful. This gives me all active stages  their
 percentage completion but I am unable to tie stages to job group (or
 specific job). I looked at Spark's code and to me, it
 seems org.apache.spark.scheduler.ActiveJob's group ID should get propagated
 to StageInfo (possibly in the StageInfo.fromStage method). For now, I will
 have to write my own version JobProgressListener that stores stageId to
 group Id mapping.

 I will submit a JIRA ticket and seek spark dev's opinion on this. Many
 thanks for your prompt help Andy.

 Thanks,
 Aniket


 On Tue Nov 18 2014 at 19:40:06 andy petrella andy.petre...@gmail.com
 wrote:

 I started some quick hack for that in the notebook, you can head to:
 https://github.com/andypetrella/spark-notebook/
 blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala

 On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am writing yet another Spark job server and have been able to submit
 jobs and return/save results. I let multiple jobs use the same spark
 context but I set job group while firing each job so that I can in future
 cancel jobs. Further, what I deserve to do is provide some kind of status
 update/progress on running jobs (a % completion but be awesome) but I am
 unable to figure out appropriate spark API to use. I do however see status
 reporting in spark UI so there must be a way to get status of various
 stages per job group. Any hints on what APIs should I look at?




Re: Scala Spark IDE help

2014-10-28 Thread andy petrella
Also, I'm following to master students at the University of Liège (one for
computing prob conditional density on massive data and the other
implementing a Markov Chain method on georasters), I proposed them to use
the Spark-Notebook to learn the framework, they're quite happy with it (so
far at least).

I know it's not a dev option, but it can help for proofing some sequence of
steps - it's easier than REPL to run the whole thing when we change
intermediate step for instance, and it's lightweight compared to sbt runs
specially using testing framework).

my0.1c

cheers,


aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Tue, Oct 28, 2014 at 6:01 PM, Matt Narrell matt.narr...@gmail.com
wrote:

 So, Im using Intellij 13.x, and Scala Spark jobs.

 Make sure you have singletons (objects, not classes), then simply debug
 the main function.  You’ll need to set your master to some derivation of
 “local”, but thats it.  Spark Streaming is kinda wonky when debugging, but
 data-at-rest behaves like you’d expect.

 hth

 mn

 On Oct 27, 2014, at 3:03 PM, Eric Tanner eric.tan...@justenough.com
 wrote:

 I am a Scala / Spark newbie (attending Paco Nathan's class).

 What I need is some advice as to how to set up intellij (or eclipse) to be
 able to attache to the process executing to the debugger.  I know that this
 is not feasible if the code is executing within the cluster.  However, if
 spark is running locally (on my laptop) I would like to attach the debugger
 process to the spark program that is running locally to be able to step
 through the program.

 Any advice will be is helpful.

 Eric

 --





 *Eric Tanner*Big Data Developer

 image005.png

 15440 Laguna Canyon, Suite 100

 Irvine, CA 92618


 Cell:
 Tel:
 Skype:
 Web:

   +1 (951) 313-9274
   +1 (949) 706-0400
   e http://tonya.nicholls.je/ric.tanner.je
   www.justenough.com

 Confidentiality Note: The information contained in this email and
 document(s) attached are for the exclusive use of the addressee and may
 contain confidential, privileged and non-disclosable information. If the
 recipient of this email is not the addressee, such recipient is strictly
 prohibited from reading, photocopying, distribution or otherwise using this
 email or its contents in any way.





Re: Interactive interface tool for spark

2014-10-12 Thread andy petrella
Dear Sparkers,

As promised, I've just updated the repo with a new name (for the sake of
clarity), default branch but specially with a dedicated README containing:

* explanations on how to launch and use it
* an intro on each feature like Spark, Classpaths, SQL, Dynamic update, ...
* pictures showing results

There is a notebook for each feature, so it's easier to try out!

Here is the repo:
https://github.com/andypetrella/spark-notebook/

HTH and PRs are more than welcome ;-).


aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com wrote:

 Hi Andy,

 This sounds awesome. Please keep us posted. Meanwhile, can you share a
 link to your project? I wasn't able to find it.

 Cheers,

 Michael

 On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com wrote:

 Heya

 You can check Zeppellin or my fork of the Scala notebook.
 I'm going this week end to push some efforts on the doc, because it
 supports for realtime graphing, Scala, SQL, dynamic loading of dependencies
 and I started this morning a widget to track the progress of the jobs.
 I'm quite happy with it so far, I used it with graphx, mllib, ADAM and the
 Cassandra connector so far.
 However, its major drawback is that it is a one man (best) effort ftm! :-S
  Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit :

  Hi, All



 We need an interactive interface tool for spark in which we can run spark
 job and plot graph to explorer the data interactively.

 Ipython notebook is good, but it only support python (we want one
 supporting scala)…



 BR,

 Kevin.









Re: Interactive interface tool for spark

2014-10-12 Thread andy petrella
Yeah, if it allows to craft some Scala/Spark code in a shareable manner, it
is a good another option!

thx for sharing

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Sun, Oct 12, 2014 at 9:47 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 And what about Hue http://gethue.com ?

 On Sun, Oct 12, 2014 at 1:26 PM, andy petrella andy.petre...@gmail.com
 wrote:

 Dear Sparkers,

 As promised, I've just updated the repo with a new name (for the sake of
 clarity), default branch but specially with a dedicated README containing:

 * explanations on how to launch and use it
 * an intro on each feature like Spark, Classpaths, SQL, Dynamic update,
 ...
 * pictures showing results

 There is a notebook for each feature, so it's easier to try out!

 Here is the repo:
 https://github.com/andypetrella/spark-notebook/

 HTH and PRs are more than welcome ;-).


 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com
 wrote:

 Hi Andy,

 This sounds awesome. Please keep us posted. Meanwhile, can you share a
 link to your project? I wasn't able to find it.

 Cheers,

 Michael

 On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com
 wrote:

 Heya

 You can check Zeppellin or my fork of the Scala notebook.
 I'm going this week end to push some efforts on the doc, because it
 supports for realtime graphing, Scala, SQL, dynamic loading of dependencies
 and I started this morning a widget to track the progress of the jobs.
 I'm quite happy with it so far, I used it with graphx, mllib, ADAM and
 the Cassandra connector so far.
 However, its major drawback is that it is a one man (best) effort ftm!
 :-S
  Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit :

  Hi, All



 We need an interactive interface tool for spark in which we can run
 spark job and plot graph to explorer the data interactively.

 Ipython notebook is good, but it only support python (we want one
 supporting scala)…



 BR,

 Kevin.











Re: Interactive interface tool for spark

2014-10-09 Thread andy petrella
Hey,

Regarding python libs, I'd say it's not supported out of the box, however
it must be quite easy to generate plots using jFreeChart and automatically
add 'em to the DOM.
Nevertheless, I added an extensible support for javascript manipulation of
results, using that one it's rather easy to plot line, scatterplots etc
using D3.js (via scala wrappers, or javascript directly), there is a small
support of Rickshaw to display timeseries.

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Thu, Oct 9, 2014 at 12:43 AM, Kelvin Chu 2dot7kel...@gmail.com wrote:

 Hi Andy,

 It sounds great! Quick questions: I have been using IPython + PySpark. I
 crunch the data by PySpark and then visualize the data by Python libraries
 like matplotlib and basemap. Could I still use these Python libraries in
 the Scala Notebook? If not, what is suggested approaches for visualization
 there? Thanks.

 Kelvin

 On Wed, Oct 8, 2014 at 9:14 AM, andy petrella andy.petre...@gmail.com
 wrote:

 Sure! I'll post updates as well in the ML :-)
 I'm doing it on twitter for now (until doc is ready).

 The repo is there (branch spark) :
 https://github.com/andypetrella/scala-notebook/tree/spark

 Some tweets:
 * very first working stuff:
 https://twitter.com/noootsab/status/508758335982927872/photo/1
 * using graphx:
 https://twitter.com/noootsab/status/517073481104908289/photo/1
 * using sql (it has already evolved in order to declare variable names):
 https://twitter.com/noootsab/status/518917295226515456/photo/1
 * using ADAM+mllib:
 https://twitter.com/noootsab/status/511270449054220288/photo/1

 There are plenty of others stuffs but will need some time for the
 write-up (soon)


 cheers,
 andy

 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com
 wrote:

 Hi Andy,

 This sounds awesome. Please keep us posted. Meanwhile, can you share a
 link to your project? I wasn't able to find it.

 Cheers,

 Michael

 On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com
 wrote:

 Heya

 You can check Zeppellin or my fork of the Scala notebook.
 I'm going this week end to push some efforts on the doc, because it
 supports for realtime graphing, Scala, SQL, dynamic loading of dependencies
 and I started this morning a widget to track the progress of the jobs.
 I'm quite happy with it so far, I used it with graphx, mllib, ADAM and
 the Cassandra connector so far.
 However, its major drawback is that it is a one man (best) effort ftm!
 :-S
  Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit :

  Hi, All



 We need an interactive interface tool for spark in which we can run
 spark job and plot graph to explorer the data interactively.

 Ipython notebook is good, but it only support python (we want one
 supporting scala)…



 BR,

 Kevin.











Re: Interactive interface tool for spark

2014-10-08 Thread andy petrella
Sure! I'll post updates as well in the ML :-)
I'm doing it on twitter for now (until doc is ready).

The repo is there (branch spark) :
https://github.com/andypetrella/scala-notebook/tree/spark

Some tweets:
* very first working stuff:
https://twitter.com/noootsab/status/508758335982927872/photo/1
* using graphx:
https://twitter.com/noootsab/status/517073481104908289/photo/1
* using sql (it has already evolved in order to declare variable names):
https://twitter.com/noootsab/status/518917295226515456/photo/1
* using ADAM+mllib:
https://twitter.com/noootsab/status/511270449054220288/photo/1

There are plenty of others stuffs but will need some time for the write-up
(soon)


cheers,
andy

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com wrote:

 Hi Andy,

 This sounds awesome. Please keep us posted. Meanwhile, can you share a
 link to your project? I wasn't able to find it.

 Cheers,

 Michael

 On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com wrote:

 Heya

 You can check Zeppellin or my fork of the Scala notebook.
 I'm going this week end to push some efforts on the doc, because it
 supports for realtime graphing, Scala, SQL, dynamic loading of dependencies
 and I started this morning a widget to track the progress of the jobs.
 I'm quite happy with it so far, I used it with graphx, mllib, ADAM and the
 Cassandra connector so far.
 However, its major drawback is that it is a one man (best) effort ftm! :-S
  Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit :

  Hi, All



 We need an interactive interface tool for spark in which we can run spark
 job and plot graph to explorer the data interactively.

 Ipython notebook is good, but it only support python (we want one
 supporting scala)…



 BR,

 Kevin.









Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread andy petrella
I'll try my best ;-).

1/ you could create a abstract type for the types (1 on top of Vs, 1 other
on top of Es types) than use the subclasses as payload in your VertexRDD or
in your Edge. Regarding storage and files, it doesn't really matter (unless
you want to use the OOTB loading method, thus you need to cope with the
convention used). What could be done is to have on file per type and then
load them all in Spark, then union them all as a whole vertex rdd or edge
rdd.

2/ AFAIK there are local indices in GraphX (implicits ones per partition)
so that index lookup is very fast based on the VertexId. Given that, it
shouldn't be usefull to have another index. Unless, you want specific
lookups, in this case you could have your own lookup from X → VertexId then
use the usual `lookup`

my2€

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Wed, Oct 1, 2014 at 4:35 PM, Oshi oshc...@gmail.com wrote:

 Hi,

 Sorry this question may be trivial. I'm new to Spark and GraphX. I need to
 create a graph that has different types of nodes(3 types) and edges(4
 types). Each type of node and edge has a different list of attributes.

 1) How should I build the graph? Should I specify all types of nodes(or
 edges) in one input file to create the vertexRDD(or edgeRDD)?
 2) Is it possible to create indices on the type of node to make searching
 faster?

 Thanks!!





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Types-for-the-Nodes-and-Edges-tp15486.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
Heya,

I started to port the scala-notebook to Spark some weeks ago (but doing it
in my sparse time and for my Spark talks ^^). It's a WIP but works quite
fine ftm, you can check my fork and branch over here:
https://github.com/andypetrella/scala-notebook/tree/spark

Feel free to ask any questions, I'll happy to help of course (PRs are more
than welcome :-P)

Cheers,

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Mon, Sep 29, 2014 at 10:19 AM, IT CTO goi@gmail.com wrote:

 Hi,
 Has anyone know of a REPL interface for spark on GIT which support similar
 user experience as presented by Databricks in there cloud demo?

 We are looking for something similar but one that can be deployed on
 premise and not on the cloud.

 --
 Eran | CTO



Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
Cool!!! I'll give it a try ASAP!

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Mon, Sep 29, 2014 at 10:48 AM, moon soo Lee leemoon...@gmail.com wrote:

 Hi,

 There is project called Zeppelin.

 You can checkout here
 https://github.com/NFLabs/zeppelin

 Homepage is here.
 http://zeppelin-project.org/

 It's notebook style tool (like databrics demo, scala-notebook) with nice
 UI, with built-in Spark integration.

 It's in active development, so don't hesitate ask questions, request
 features to the mailing list.

 Thanks.

 - moon

 On Mon, Sep 29, 2014 at 5:27 PM, andy petrella andy.petre...@gmail.com
 wrote:

 Heya,

 I started to port the scala-notebook to Spark some weeks ago (but doing
 it in my sparse time and for my Spark talks ^^). It's a WIP but works quite
 fine ftm, you can check my fork and branch over here:
 https://github.com/andypetrella/scala-notebook/tree/spark

 Feel free to ask any questions, I'll happy to help of course (PRs are
 more than welcome :-P)

 Cheers,

 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Mon, Sep 29, 2014 at 10:19 AM, IT CTO goi@gmail.com wrote:

 Hi,
 Has anyone know of a REPL interface for spark on GIT which support
 similar user experience as presented by Databricks in there cloud demo?

 We are looking for something similar but one that can be deployed on
 premise and not on the cloud.

 --
 Eran | CTO






Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
However (I must say ^^) that it's funny that it has been build using usual
plain old Java stuffs :-D.

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Mon, Sep 29, 2014 at 10:51 AM, andy petrella andy.petre...@gmail.com
wrote:

 Cool!!! I'll give it a try ASAP!

 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Mon, Sep 29, 2014 at 10:48 AM, moon soo Lee leemoon...@gmail.com
 wrote:

 Hi,

 There is project called Zeppelin.

 You can checkout here
 https://github.com/NFLabs/zeppelin

 Homepage is here.
 http://zeppelin-project.org/

 It's notebook style tool (like databrics demo, scala-notebook) with nice
 UI, with built-in Spark integration.

 It's in active development, so don't hesitate ask questions, request
 features to the mailing list.

 Thanks.

 - moon

 On Mon, Sep 29, 2014 at 5:27 PM, andy petrella andy.petre...@gmail.com
 wrote:

 Heya,

 I started to port the scala-notebook to Spark some weeks ago (but doing
 it in my sparse time and for my Spark talks ^^). It's a WIP but works quite
 fine ftm, you can check my fork and branch over here:
 https://github.com/andypetrella/scala-notebook/tree/spark

 Feel free to ask any questions, I'll happy to help of course (PRs are
 more than welcome :-P)

 Cheers,

 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Mon, Sep 29, 2014 at 10:19 AM, IT CTO goi@gmail.com wrote:

 Hi,
 Has anyone know of a REPL interface for spark on GIT which support
 similar user experience as presented by Databricks in there cloud demo?

 We are looking for something similar but one that can be deployed on
 premise and not on the cloud.

 --
 Eran | CTO







Re: Example of Geoprocessing with Spark

2014-09-20 Thread andy petrella
It's probably slw as you say because it's actually also doing the map
phase that will do the RTree search and so on, and only then saving to hdfs
on 60 partition. If you want to see the time spent in saving to hdfs, you
could do a count for instance before saving. Also saving from 60 partition
might be overkill so what you can do is first recoalescing to the number of
physical nodes that you have (without shuffling).

On the other hand, I don't know if you're running this in a cluster but
geoDataMun looks rather heavy to serialize so it would be preferable to
broadcast it once (since it won't change).

Also, it might only be a syntax improvement but the construction of (cve_est,
cve_mun) is rather long and seems that it can be replaced by these 3 lines
only:




 *val (cve_est, cve_mun) =   internal collectFirst {case
 (g:Geometry,e:String,m:String) if g.intersects(point) = (e, m)  }
 getOrElse (0, 0)*


HTH

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Sat, Sep 20, 2014 at 5:50 AM, Abel Coronado Iruegas 
acoronadoirue...@gmail.com wrote:

 Hi Evan,

 here a improved version, thanks for your advice. But you know the last
 step,
 the SaveAsTextFile is very Slw, :(

 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import org.apache.spark.SparkConf
 import java.net.URL
 import java.text.SimpleDateFormat
 import com.vividsolutions.jts.geom._
 import com.vividsolutions.jts.index.strtree.STRtree
 import com.vividsolutions.jts.io._
 import org.geotools.data.FileDataStoreFinder
 import org.geotools.geometry.jts.{JTSFactoryFinder, ReferencedEnvelope}
 import org.opengis.feature.{simple, Feature, FeatureVisitor}
 import scala.collection.JavaConverters._


 object SimpleApp {
 def main(args: Array[String]){
 val conf = new SparkConf().setAppName(Csv Clipper)
 val sc = new SparkContext(conf)
 val csvPath = hdfs://m01/user/acoronado/mov/movilidad.csv
 val csv = sc.textFile(csvPath)
 //csv.coalesce(60,true)
 csv.cache()
 val clipPoints = csv.map({line: String =
   val Array(usuario, lat, lon,
 date) = line.split(,).map(_.trim)
   val geometryFactory =
 JTSFactoryFinder.getGeometryFactory();
   val reader = new
 WKTReader(geometryFactory);
   val point =
 reader.read(POINT
 (+lon+ + lat + ) )
   val envelope =
 point.getEnvelopeInternal
   val internal =
 geoDataMun.get(envelope)
   val (cve_est, cve_mun) =
 internal match {
 case l = {
   val
 existe
 = l.find( f = f match { case (g:Geometry,e:String,m:String) =
 g.intersects(point) case _ = false} )
   existe
 match {

 case Some(t)  = t match { case (g:Geometry,e:String,m:String) = (e,m)
 case
 _ = (0,0)}

 case None = (0, 0)

 }
 }
 case _ = (0, 0)
   }
   val time = try {(new
 SimpleDateFormat(-MM-dd'T'HH:mm:ss.SSSZ)).parse(date.replaceAll(Z$,
 +)).getTime().toString()} catch {case e: Exception = 0}


 line+,+time+,+cve_est+,+cve_mun
 })


 clipPoints.saveAsTextFile(hdfs://m01/user/acoronado/mov/resultados_movilidad_fast.csv)
 }
 object geoDataMun {
   var spatialIndex = new STRtree()
   val path = new
 URL(file:geoData/MunicipiosLatLon.shp)
   val store = FileDataStoreFinder.getDataStore(path)
   val source = store.getFeatureSource();
   val features = source.getFeatures();
   val  it = features.features();
   while(it.hasNext){
 val feature = it.next()
 val geom  =  feature.getDefaultGeometry
 if (geom != null) {
   val geomClass = geom match {   case g2: Geometry
 = g2  case _ = throw new ClassCastException }
   val env = geomClass.getEnvelopeInternal();
   if (!env.isNull) {
 spatialIndex.insert(env,
 (geomClass,feature.getAttribute(1),feature.getAttribute(2)));
   }
 }
   }
   def get(env:Envelope) =
 

Re: Serving data

2014-09-15 Thread andy petrella
I'm using Parquet in ADAM, and I can say that it works pretty fine!
Enjoy ;-)

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier mps@gmail.com wrote:

 Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go
 the usual route with either read-only or normal database.

 On 13.09.2014, at 12:45, andy petrella andy.petre...@gmail.com wrote:

 however, the cache is not guaranteed to remain, if other jobs are launched
 in the cluster and require more memory than what's left in the overall
 caching memory, previous RDDs will be discarded.

 Using an off heap cache like tachyon as a dump repo can help.

 In general, I'd say that using a persistent sink (like Cassandra for
 instance) is best.

 my .2¢


 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 You can cache data in memory  query it using Spark Job Server.
 Most folks dump data down to a queue/db for retrieval
 You can batch up data  store into parquet partitions as well.  query it
 using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
 believe.
 --
 Regards,
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi


 On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com
 wrote:

 Hi there,

 I’m pretty new to Spark, and so far I’ve written my jobs the same way I
 wrote Scalding jobs - one-off, read data from HDFS, count words, write
 counts back to HDFS.

 Now I want to display these counts in a dashboard. Since Spark allows to
 cache RDDs in-memory and you have to explicitly terminate your app (and
 there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep
 an app running indefinitely and query an in-memory RDD from the outside
 (via SparkSQL for example).

 Is this how others are using Spark? Or are you just dumping job results
 into message queues or databases?


 Thanks
 - Marius


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Serving data

2014-09-13 Thread andy petrella
however, the cache is not guaranteed to remain, if other jobs are launched
in the cluster and require more memory than what's left in the overall
caching memory, previous RDDs will be discarded.

Using an off heap cache like tachyon as a dump repo can help.

In general, I'd say that using a persistent sink (like Cassandra for
instance) is best.

my .2¢


aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 You can cache data in memory  query it using Spark Job Server.
 Most folks dump data down to a queue/db for retrieval
 You can batch up data  store into parquet partitions as well.  query it
 using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
 believe.
 --
 Regards,
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi


 On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote:

 Hi there,

 I’m pretty new to Spark, and so far I’ve written my jobs the same way I
 wrote Scalding jobs - one-off, read data from HDFS, count words, write
 counts back to HDFS.

 Now I want to display these counts in a dashboard. Since Spark allows to
 cache RDDs in-memory and you have to explicitly terminate your app (and
 there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep
 an app running indefinitely and query an in-memory RDD from the outside
 (via SparkSQL for example).

 Is this how others are using Spark? Or are you just dumping job results
 into message queues or databases?


 Thanks
 - Marius


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: New sbt plugin to deploy jobs to EC2

2014-09-05 Thread andy petrella
\o/ = will test it soon or sooner, gr8 idea btw

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab


On Fri, Sep 5, 2014 at 12:37 PM, Felix Garcia Borrego fborr...@gilt.com
wrote:

 As far as I know in other to deploy and execute jobs in EC2 you need to
 assembly you project, copy your jar into the cluster, log into using ssh
 and submit the job.

 To avoid having to do this I've been prototyping an sbt plugin(1) that
 allows to create and send Spark jobs to an Amazon EC2 cluster directly from
 your local machine using sbt.

 It's a simple plugin that actually rely on spark-ec2 and spark-submit, but
  I'd like to have feedback and see if this plugin makes any sense before
 going ahead with the final impl or if there is any other easy way to do so.

 (1) https://github.com/felixgborrego/sbt-spark-ec2-plugin

 Thanks,





Re: creating a distributed index

2014-08-01 Thread andy petrella
Hey,
There is some work that started on IndexedRDD (on master I think).
Meanwhile, checking what has been done in GraphX regarding vertex index in
partitions could be worthwhile I guess
Hth
Andy
Le 1 août 2014 22:50, Philip Ogren philip.og...@oracle.com a écrit :


 Suppose I want to take my large text data input and create a distributed
 inverted index in Spark on each string in the input (imagine an in-memory
 lucene index - not want I'm doing but it's analogous).  It seems that I
 could do this with mapPartition so that each element in a partition gets
 added to an index for that partition.  I'm making the simplifying
 assumption that the individual indexes do not need to coordinate any global
 metrics so that e.g. tf-idf scores are consistent across these indexes.
  Would it then be possible to take a string and query each partition's
 index with it?  Or better yet, take a batch of strings and query each
 string in the batch against each partition's index?

 Thanks,
 Philip




Re: iScala or Scala-notebook

2014-07-29 Thread andy petrella
Some people started some work on that topic using the notebook (the
original or the n8han one, cannot remember)... Some issues have ben created
already ^^
Le 29 juil. 2014 19:59, Nick Pentreath nick.pentre...@gmail.com a
écrit :

 IScala itself seems to be a bit dead unfortunately.

 I did come across this today: https://github.com/tribbloid/ISpark


 On Fri, Jul 18, 2014 at 4:59 AM, ericjohnston1989 
 ericjohnston1...@gmail.com wrote:

 Hey everyone,

 I know this was asked before but I'm wondering if there have since been
 any
 updates. Are there any plans to integrate iScala/Scala-notebook with spark
 in the near future?

 This seems like something a lot of people would find very useful, so I was
 just wondering if anyone has started working on it.

 Thanks,

 Eric



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/iScala-or-Scala-notebook-tp10127.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: Debugging Task not serializable

2014-07-28 Thread andy petrella
Also check the guides for the JVM option that prints messages for such
problems.
Sorry, sent from phone and don't know it by heart :/
Le 28 juil. 2014 18:44, Akhil Das ak...@sigmoidanalytics.com a écrit :

 A quick fix would be to implement java.io.Serializable in those classes
 which are causing this exception.



 Thanks
 Best Regards


 On Mon, Jul 28, 2014 at 9:21 PM, Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi all,

 I was wondering if someone has conceived a method for debugging Task not
 serializable: java.io.NotSerializableException errors, apart from
 commenting and uncommenting parts of the program, or just turning
 everything into Serializable. I find this kind of error very hard to debug,
 as these are originated in the Spark runtime system.

 I'm using Spark for Java.

 Thanks a lot in advance,

 Juan





Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread andy petrella
heya,

Without a bit of gymnastic at the type level, nope. Actually RDD doesn't
share any functions with the scala lib (the simple reason I could see is
that the Spark's ones are lazy, the default implementations in Scala
aren't).

However, it'd be possible by implementing an implicit converter from a
SeqLike (f.i.) to an RDD, nonetheless it'd be cumbersome because the
overlap between the two world isn't entire (for instance, flatMap haven't
the same semantic, drop is hard, etc).

Also, it'd scary me a bit to have this kind of bazooka waiting me a next
corner, by letting me think that a iterative like process can be ran in a
distributed world :-).

OTOH, the inverse is quite easy, an implicit conv from RDD to an Array is
simply a call to collect (take care that RDD is not covariant -- I think
it'd be related to the fact that the ClassTag is needed!?)

only my .2 ¢




 aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab


On Mon, Jul 21, 2014 at 7:01 PM, Philip Ogren philip.og...@oracle.com
wrote:

 It is really nice that Spark RDD's provide functions  that are often
 equivalent to functions found in Scala collections.  For example, I can
 call:

 myArray.map(myFx)

 and equivalently

 myRdd.map(myFx)

 Awesome!

 My question is this.  Is it possible to write code that works on either an
 RDD or a local collection without having to have parallel implementations?
  I can't tell that RDD or Array share any supertypes or traits by looking
 at the respective scaladocs. Perhaps implicit conversions could be used
 here.  What I would like to do is have a single function whose body is like
 this:

 myData.map(myFx)

 where myData could be an RDD[Array[String]] (for example) or an
 Array[Array[String]].

 Has anyone had success doing this?

 Thanks,
 Philip





Re: Generic Interface between RDD and DStream

2014-07-11 Thread andy petrella
A while ago, I wrote this:
```

package com.virdata.core.compute.common.api

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.StreamingContext

sealed trait SparkEnvironment extends Serializable {
  type Context

  type Wagon[A]
}
object Batch extends SparkEnvironment {
  type Context = SparkContext
  type Wagon[A] = RDD[A]
}
object Streaming extends SparkEnvironment{
  type Context = StreamingContext
  type Wagon[A] = DStream[A]
}

```
Then I can produce code like this (just an example)

```

package com.virdata.core.compute.common.api

import org.apache.spark.Logging

trait Process[M[_], In, N[_], Out, E : SparkEnvironment] extends
Logging { self =

  def run(in:M[E#Wagon[In]])(implicit context:E#Context):N[E#Wagon[Out]]

  def pipe[Q[_],U](follow:Process[N,Out,Q,U,E]):Process[M,In,Q,U,E] =
new Process[M,In,Q,U,E] {
override def run(in: M[E#Wagon[In]])(implicit context: E#Context):
Q[E#Wagon[U]] = {
  val run1: N[E#Wagon[Out]] = self.run(in)
  follow.run(run1)
}
  }
}

```

It's not resolving the whole thing, because we'll still have to duplicate
both code (for Batch and Streaming).
However, when the common traits will be there I'll have to remove half of
the implementations only -- without touching the calling side (using them),
and thus keeping my plain old backward compat' ^^.

I know it's just an intermediate hack, but still ;-)

greetz,


  aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab


On Sat, Jul 12, 2014 at 12:57 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 I totally agree that doing if we are able to do this it will be very cool.
 However, this requires having a common trait / interface between RDD and
 DStream, which we dont have as of now. It would be very cool though. On my
 wish list for sure.

 TD


 On Thu, Jul 10, 2014 at 11:53 AM, mshah shahmaul...@gmail.com wrote:

 I wanted to get a perspective on how to share code between Spark batch
 processing and Spark Streaming.

 For example, I want to get unique tweets stored in a HDFS file then in
 both
 Spark Batch and Spark Streaming. Currently I will have to do following
 thing:

 Tweet {
 String tweetText;
 String userId;
 }

 Spark Batch:
 tweets = sparkContext.newHadoopApiAsFile(tweet);

 def getUniqueTweets(tweets: RDD[Tweet])= {
  tweets.map(tweet=(tweetText,
 tweet).groupByKey(tweetText).map((tweetText, _) =tweetText)
 }

 Spark Streaming:

 tweets = streamingContext.fileStream(tweet);

 def getUniqueTweets(tweets: DStream[Tweet])= {
  tweets.map(tweet=(tweetText,
 tweet).groupByKey(tweetText).map((tweetText, _) =tweetText)
 }


 Above example shows I am doing the same thing but I have to replicate the
 code as there is no common abstraction between DStream and RDD,
 SparkContext
 and Streaming Context.

 If there was a common abstraction it would have been much simlper:

 tweets = context.read(tweet, Stream or Batch)

 def getUniqueTweets(tweets: SparkObject[Tweet])= {
  tweets.map(tweet=(tweetText,
 tweet).groupByKey(tweetText).map((tweetText, _) =tweetText)
 }

 I would appreciate thoughts on it. Is it already available? Is there any
 plan to add this support? Is it intentionally not supported because of
 design choice?





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Generic-Interface-between-RDD-and-DStream-tp9331.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: Spark and RDF

2014-06-20 Thread andy petrella
Maybe some SPARQL features in Shark, then ?

 aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab


On Fri, Jun 20, 2014 at 9:45 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 You are looking to create Shark operators for RDF? Since Shark backend is
 shifting to SparkSQL it would be slightly hard but much better effort would
 be to shift Gremlin to Spark (though a much beefier one :) )

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Jun 20, 2014 at 3:39 PM, andy petrella andy.petre...@gmail.com
 wrote:

 For RDF, may GraphX be particularly approriated?

  aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab


 On Thu, Jun 19, 2014 at 4:49 PM, Flavio Pompermaier pomperma...@okkam.it
  wrote:

 Hi guys,

 I'm analyzing the possibility to use Spark to analyze RDF files and
 define reusable Shark operators on them (custom filtering, transforming,
 aggregating, etc). Is that possible? Any hint?

 Best,
 Flavio






Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread andy petrella
SparkListener offers good stuffs.
But I also completed it with another metrics stuffs on my own that use Akka
to aggregate metrics from anywhere I'd like to collect them (without any
deps on ganglia yet on Codahale).
However, this was useful to gather some custom metrics (from within the
tasks then) not really to collect overall monitoring information about the
spark thingies themselves.
For that Spark UI offers already a pretty good insight no?

Cheers,

 aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab


On Thu, May 22, 2014 at 4:51 PM, Pierre B 
pierre.borckm...@realimpactanalytics.com wrote:

 Is there a simple way to monitor the overall progress of an action using
 SparkListener or anything else?

 I see that one can name an RDD... Could that be used to determine which
 action triggered a stage, ... ?


 Thanks

 Pierre



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread andy petrella
Yeah, actually for that I used directly codahale with my own stuffs using
the Akka system from within Spark itself.

So the workers send messages back to a bunch of actors on the driver which
are using codahale metrics.
This way I can collect what/how an executor do/did, but also I can
aggregate all executors metrics at once (via dedicated aggregation purposed
codahale metrics).

However, I didn't had time to dig enough in Spark to see with I could reuse
the SparkListener system itself -- which is kind-of doing the same thing,
but w/o akka AFAICT = where I can see that TaskMetrics are collected by
task within the context/granularity of a Stage. Than aggregation looks like
being done in a built-in (Queued) Bus. So I'll let someone else report how
this could be extended, but my gut feeling that it won't be straightforward.

hth (with respect of my limited knowledge of these internals ^^)

cheers,


  aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab


On Thu, May 22, 2014 at 5:02 PM, Pierre B 
pierre.borckm...@realimpactanalytics.com wrote:

 Hi Andy!

 Yes Spark UI provides a lot of interesting informations for debugging
 purposes.

 Here I’m trying to integrate a simple progress monitoring in my app ui.

 I’m typically running a few “jobs” (or rather actions), and I’d like to be
 able to display the progress of each of those in my ui.

 I don’t really see how i could do that using SparkListener for the moment …

 Thanks for your help!

 Cheers!




   *Pierre Borckmans*
 Software team

 *Real**Impact* Analytics *| *Brussels Office
  www.realimpactanalytics.com *| *[hidden 
 email]http://user/SendEmail.jtp?type=nodenode=6259i=0

 *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans






 On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] [hidden
 email] http://user/SendEmail.jtp?type=nodenode=6259i=1 wrote:

 SparkListener offers good stuffs.
 But I also completed it with another metrics stuffs on my own that use
 Akka to aggregate metrics from anywhere I'd like to collect them (without
 any deps on ganglia yet on Codahale).
 However, this was useful to gather some custom metrics (from within the
 tasks then) not really to collect overall monitoring information about the
 spark thingies themselves.
 For that Spark UI offers already a pretty good insight no?

 Cheers,

  aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab


 On Thu, May 22, 2014 at 4:51 PM, Pierre B a
 href=x-msg://7/user/SendEmail.jtp?type=nodeamp;node=6258amp;i=0
 target=_top rel=nofollow link=external[hidden email] wrote:

 Is there a simple way to monitor the overall progress of an action using
 SparkListener or anything else?

 I see that one can name an RDD... Could that be used to determine which
 action triggered a stage, ... ?


 Thanks

 Pierre



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6258.html
  To unsubscribe from Use SparkListener to get overall progress of an
 action, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Use SparkListener to get overall
 progress of an 
 actionhttp://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6259.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
Hyea,

I still have to try it myself (I'm trying to create GCE images with Spark
on Mesos 0.18.0) but I think your change is one of the required ones,
however my gut feeling is that others will be required to have this working.

Actually, in my understanding, this core dump is due to protobuf
incompatibilities between 2.4.1 and 2.5.0. So I guess that we should also
recompile or update or ... these projects:
org.spark-project.akka  %% akka-remote  %
2.2.3-shaded-protobuf  excludeAll(excludeNetty),
org.spark-project.akka  %% akka-slf4j   %
2.2.3-shaded-protobuf  excludeAll(excludeNetty),
because they will come with:
org.spark-project.protobuf  %% protobuf-java   %
2.4.1-shaded

And maybe, to force to whole thing, add this deps:
com.google.protobuf  % protobuf-java% 2.5.0

That's only my0.02c in order to try to move things forward until someone
from the Spark team to step in with more clear and strong advices...

kr,


Andy Petrella
Belgium (Liège)

*   *
 Data Engineer in *NextLab http://nextlab.be/ sprl* (owner)
 Engaged Citizen Coder for *WAJUG http://wajug.be/* (co-founder)
 Author of *Learning Play! Framework 2
http://www.packtpub.com/learning-play-framework-2/book*
 Bio: on visify https://www.vizify.com/es/52c3feec2163aa0010001eaa

On Thu, Apr 17, 2014 at 8:29 PM, Steven Cox s...@renci.org wrote:

  So I tried a fix found on the list...

The issue was due to meos version mismatch as I am using latest mesos 
 0.17.0,
 but spark uses 0.13.0.
 Fixed by updating the SparkBuild.scala to latest version.

 I changed this line in SparkBuild.scala
 org.apache.mesos % mesos% 0.13.0,
 to
 org.apache.mesos % mesos% 0.18.0,

 ...ran make-distribution.sh, repackaged and redeployed the tar.gz to HDFS.

 It still core dumps like this:
 https://gist.github.com/stevencox/11002498

 In this environment:
   Ubuntu 13.10
   Mesos 0.18.0
   Spark 0.9.1
   JDK 1.7.0_45
   Scala 2.10.1

 What am I missing?



Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
No of course, but I was guessing some native libs imported (to communicate
with Mesos) in the project that... could miserably crash the JVM.

Anyway, so you tell us that using this oracle version, you don't have any
issues when using spark on mesos 0.18.0, that's interesting 'cause AFAIR,
my last test (done by night, which means floating and eventual memory) I
was using this particular version as well.

Just to make thing clear, Sean, you're using spark 0.9.1 on Mesos 0.18.0
with Hadoop 2.x (x = 2) without any modification than just specifying
against which version of hadoop you had run make-distribution?

Thanks for your help,

Andy

On Thu, Apr 17, 2014 at 9:11 PM, Sean Owen so...@cloudera.com wrote:

 I don't know if it's anything you or the project is missing... that's
 just a JDK bug.
 FWIW I am on 1.7.0_51 and have not seen anything like that.

 I don't think it's a protobuf issue -- you don't crash the JVM with
 simple version incompatibilities :)
 --
 Sean Owen | Director, Data Science | London


 On Thu, Apr 17, 2014 at 7:29 PM, Steven Cox s...@renci.org wrote:
  So I tried a fix found on the list...
 
 The issue was due to meos version mismatch as I am using latest mesos
  0.17.0, but spark uses 0.13.0.
  Fixed by updating the SparkBuild.scala to latest version.
 
  I changed this line in SparkBuild.scala
  org.apache.mesos % mesos% 0.13.0,
  to
  org.apache.mesos % mesos% 0.18.0,
 
  ...ran make-distribution.sh, repackaged and redeployed the tar.gz to
 HDFS.
 
  It still core dumps like this:
  https://gist.github.com/stevencox/11002498
 
  In this environment:
Ubuntu 13.10
Mesos 0.18.0
Spark 0.9.1
JDK 1.7.0_45
Scala 2.10.1
 
  What am I missing?



Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
If you can test it quickly, an option would be to try with the exact same
version that Sean used (1.7.0_51) ?

Maybe it was a bug fixed in 51 and a regression has been introduced in 55
:-D

Andy
On Thu, Apr 17, 2014 at 9:36 PM, Steven Cox s...@renci.org wrote:

  FYI, I've tried older versions (jdk6.x), openjdk. Also here's a fresh
 core dump on jdk7u55-b13:

  # A fatal error has been detected by the Java Runtime Environment:

 #

 #  SIGSEGV (0xb) at pc=0x7f7c6b718d39, pid=7708, tid=140171900581632

 #

 # JRE version: Java(TM) SE Runtime Environment (7.0_55-b13) (build
 1.7.0_55-b13)

 # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.55-b03 mixed mode
 linux-amd64 compressed oops)

 # Problematic frame:

 # V  [libjvm.so+0x632d39]  jni_GetByteArrayElements+0x89

 #

 # Failed to write core dump. Core dumps have been disabled. To enable core
 dumping, try ulimit -c unlimited before starting Java again

 #

 # An error report file with more information is saved as:

 # /home/scox/skylr/skylr-analytics/hs_err_pid7708.log

 #

 # If you would like to submit a bug report, please visit:

 #   http://bugreport.sun.com/bugreport/crash.jsp


  Steve


  --
 *From:* andy petrella [andy.petre...@gmail.com]
 *Sent:* Thursday, April 17, 2014 3:21 PM
 *To:* user@spark.apache.org
 *Subject:* Re: Spark 0.9.1 core dumps on Mesos 0.18.0

   No of course, but I was guessing some native libs imported (to
 communicate with Mesos) in the project that... could miserably crash the
 JVM.

  Anyway, so you tell us that using this oracle version, you don't have
 any issues when using spark on mesos 0.18.0, that's interesting 'cause
 AFAIR, my last test (done by night, which means floating and eventual
 memory) I was using this particular version as well.

  Just to make thing clear, Sean, you're using spark 0.9.1 on Mesos 0.18.0
 with Hadoop 2.x (x = 2) without any modification than just specifying
 against which version of hadoop you had run make-distribution?

  Thanks for your help,

  Andy

 On Thu, Apr 17, 2014 at 9:11 PM, Sean Owen so...@cloudera.com wrote:

 I don't know if it's anything you or the project is missing... that's
 just a JDK bug.
 FWIW I am on 1.7.0_51 and have not seen anything like that.

 I don't think it's a protobuf issue -- you don't crash the JVM with
 simple version incompatibilities :)
 --
 Sean Owen | Director, Data Science | London


 On Thu, Apr 17, 2014 at 7:29 PM, Steven Cox s...@renci.org wrote:
   So I tried a fix found on the list...
 
 The issue was due to meos version mismatch as I am using latest
 mesos
  0.17.0, but spark uses 0.13.0.
  Fixed by updating the SparkBuild.scala to latest version.
 
  I changed this line in SparkBuild.scala
  org.apache.mesos % mesos% 0.13.0,
  to
  org.apache.mesos % mesos% 0.18.0,
 
  ...ran make-distribution.sh, repackaged and redeployed the tar.gz to
 HDFS.
 
  It still core dumps like this:
  https://gist.github.com/stevencox/11002498
 
  In this environment:
Ubuntu 13.10
Mesos 0.18.0
Spark 0.9.1
JDK 1.7.0_45
Scala 2.10.1
 
  What am I missing?





Re: ETL for postgres to hadoop

2014-04-08 Thread andy petrella
Hello Manas,

I don't know Sqoop that much but my best guess is that you're probably
using Postgis which has specific structures for Geometry and so on. And if
you need some spatial operators my gut feeling is that things will be
harder ^^ (but a raw import won't need that...).

So I did a quick check in the Sqoop documentation and it looks like
implementing a connector for this kind of structure should do the trick
(check this: http://sqoop.apache.org/docs/1.99.3/ConnectorDevelopment.html).

In any case, I'll be very interested in this kind of stuffs! More than
that, having such import tool for Oracle Spatial cartridge would be great
as well :-P.

my2c,


Andy Petrella
Belgium (Liège)

*   *
 Data Engineer in *NextLab http://nextlab.be/ sprl* (owner)
 Engaged Citizen Coder for *WAJUG http://wajug.be/* (co-founder)
 Author of *Learning Play! Framework 2
http://www.packtpub.com/learning-play-framework-2/book*
 Bio: on visify https://www.vizify.com/es/52c3feec2163aa0010001eaa
*   *
Mobile: *+32 495 99 11 04*
Mails:

   - andy.petre...@nextlab.be
   - andy.petre...@gmail.com

*   *
Socials:

   - Twitter: https://twitter.com/#!/noootsab
   - LinkedIn: http://be.linkedin.com/in/andypetrella
   - Blogger: http://ska-la.blogspot.com/
   - GitHub:  https://github.com/andypetrella
   - Masterbranch: https://masterbranch.com/andy.petrella



On Tue, Apr 8, 2014 at 10:00 PM, Manas Kar manas@exactearth.com wrote:

  Hi All,

 I have some spatial data in postgres machine. I want to be
 able to move that data to Hadoop and do some geo-processing.

 I tried using sqoop to move the data to Hadoop but it complained about the
 position data(which it says can’t recognize)

 Does anyone have any idea as to how to do it easily?



 Thanks

 Manas



http://www.exactearth.com http://www.exactearth.com   Manas Kar  
 Intermediate
 Software Developer, Product Development | exactEarth Ltd. 60 Struck
 Ct. Cambridge, Ontario N1R 8L2  office. +1.519.622.4445 ext. 5869 |
 direct: +1.519.620.5869  email. manas@exactearth.com

 web. www.exactearth.com




  This e-mail and any attachment is for authorized use by the intended
 recipient(s) only. It contains proprietary or confidential information and
 is not to be copied, disclosed to, retained or used by, any other party. If
 you are not an intended recipient then please promptly delete this e-mail,
 any attachment and all copies and inform the sender. Thank you.

inline: ee_gradient_tm_150wide.png

Re: what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread andy petrella
Indeed, it's how mesos works actually.  So the tarball just has to be
somewhere accessible by the mesos slaves.  That's why it is often put in
hdfs.
Le 3 avr. 2014 18:46, felix cnwe...@gmail.com a écrit :

 So, if I set this parameter, there is no need to copy the spark tarball to
 every mesos slave nodes? am I right?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/what-does-SPARK-EXECUTOR-URI-in-spark-env-sh-do-tp3708p3722.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: ActorNotFound problem for mesos driver

2014-04-02 Thread andy petrella
Heya,

Yep this is a problem in the Mesos scheduler implementation that has been
fixed after 0.9.0 (https://spark-project.atlassian.net/browse/SPARK-1052 =
MesosSchedulerBackend)

So several options, like applying the patch, upgrading to 0.9.1 :-/

Cheers,
Andy


On Wed, Apr 2, 2014 at 5:30 PM, Leon Zhang leonca...@gmail.com wrote:

 Hi, Spark Devs:

 I encounter a problem which shows error message akka.actor.ActorNotFound
 on our mesos mini-cluster.

 mesos : 0.17.0
 spark : spark-0.9.0-incubating

 spark-env.sh:
 #!/usr/bin/env bash

 export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
 export SPARK_EXECUTOR_URI=hdfs://
 192.168.1.20/tmp/spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz
 export MASTER=zk://192.168.1.20:2181/mesos
 export SPARK_JAVA_OPTS=-Dspark.driver.port=17077

 And the logs from each slave looks like:

 14/04/02 15:14:37 INFO MesosExecutorBackend: Using Spark's default log4j
 profile: org/apache/spark/log4j-defaults.properties
 14/04/02 15:14:37 INFO MesosExecutorBackend: Registered with Mesos as
 executor ID 201403301937-335653056-5050-991-1
 14/04/02 15:14:38 INFO Slf4jLogger: Slf4jLogger started
 14/04/02 15:14:38 INFO Remoting: Starting remoting
 14/04/02 15:14:38 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://spark@zetyun-cloud3:42218]
 14/04/02 15:14:38 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@zetyun-cloud3:42218]
 14/04/02 15:14:38 INFO SparkEnv: Connecting to BlockManagerMaster:
 akka.tcp://spark@localhost:17077/user/BlockManagerMaster
 akka.actor.ActorNotFound: Actor not found for:
 ActorSelection[Actor[akka.tcp://spark@localhost
 :17077/]/user/BlockManagerMaster]
 at
 akka.actor.ActorSelection$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
 at
 akka.actor.ActorSelection$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at
 akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
 at
 akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
 at
 akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
 at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
 at
 scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
 at
 scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
 at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
 at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
 at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
 at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
 at
 akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
 at akka.actor.ActorRef.tell(ActorRef.scala:125)
 at akka.dispatch.Mailboxes$anon$1$anon$2.enqueue(Mailboxes.scala:44)
 at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
 at
 akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
 at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
 at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
 at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
 at
 akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$finishTerminate(FaultHandling.scala:203)
 at
 akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
 at akka.actor.ActorCell.terminate(ActorCell.scala:338)
 at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
 at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
 at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
 at akka.dispatch.Mailbox.run(Mailbox.scala:218)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Exception in thread Thread-0

 Any clue for this problem?

 Thanks in advance.



Re: ActorNotFound problem for mesos driver

2014-04-02 Thread andy petrella
np ;-)


On Wed, Apr 2, 2014 at 5:50 PM, Leon Zhang leonca...@gmail.com wrote:

 Aha, thank you for your kind reply.

 Upgrading to 0.9.1 is a good choice. :)


 On Wed, Apr 2, 2014 at 11:35 PM, andy petrella andy.petre...@gmail.comwrote:

 Heya,

 Yep this is a problem in the Mesos scheduler implementation that has been
 fixed after 0.9.0 (https://spark-project.atlassian.net/browse/SPARK-1052= 
 MesosSchedulerBackend)

 So several options, like applying the patch, upgrading to 0.9.1 :-/

 Cheers,
 Andy


 On Wed, Apr 2, 2014 at 5:30 PM, Leon Zhang leonca...@gmail.com wrote:

 Hi, Spark Devs:

 I encounter a problem which shows error message
 akka.actor.ActorNotFound on our mesos mini-cluster.

 mesos : 0.17.0
 spark : spark-0.9.0-incubating

 spark-env.sh:
 #!/usr/bin/env bash

 export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
 export SPARK_EXECUTOR_URI=hdfs://
 192.168.1.20/tmp/spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz
 export MASTER=zk://192.168.1.20:2181/mesos
 export SPARK_JAVA_OPTS=-Dspark.driver.port=17077

 And the logs from each slave looks like:

 14/04/02 15:14:37 INFO MesosExecutorBackend: Using Spark's default log4j
 profile: org/apache/spark/log4j-defaults.properties
 14/04/02 15:14:37 INFO MesosExecutorBackend: Registered with Mesos as
 executor ID 201403301937-335653056-5050-991-1
 14/04/02 15:14:38 INFO Slf4jLogger: Slf4jLogger started
 14/04/02 15:14:38 INFO Remoting: Starting remoting
 14/04/02 15:14:38 INFO Remoting: Remoting started; listening on
 addresses :[akka.tcp://spark@zetyun-cloud3:42218]
 14/04/02 15:14:38 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@zetyun-cloud3:42218]
 14/04/02 15:14:38 INFO SparkEnv: Connecting to BlockManagerMaster:
 akka.tcp://spark@localhost:17077/user/BlockManagerMaster
 akka.actor.ActorNotFound: Actor not found for:
 ActorSelection[Actor[akka.tcp://spark@localhost
 :17077/]/user/BlockManagerMaster]
 at
 akka.actor.ActorSelection$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
 at
 akka.actor.ActorSelection$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at
 akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
 at
 akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
 at
 akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
 at
 akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
 at
 scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
 at
 scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
 at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
 at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
 at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
 at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
 at
 akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
 at akka.actor.ActorRef.tell(ActorRef.scala:125)
 at akka.dispatch.Mailboxes$anon$1$anon$2.enqueue(Mailboxes.scala:44)
 at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
 at
 akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
 at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
 at
 akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
 at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
 at
 akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$finishTerminate(FaultHandling.scala:203)
 at
 akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
 at akka.actor.ActorCell.terminate(ActorCell.scala:338)
 at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
 at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
 at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
 at akka.dispatch.Mailbox.run(Mailbox.scala:218)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Exception in thread Thread-0

 Any clue for this problem?

 Thanks in advance.






Re: Need suggestions

2014-04-02 Thread andy petrella
TL;DR
Your classes are missing on the workers, pass the jar containing the
class main.scala.Utils to the SparkContext

Longer:
I miss some information, like how the SparkContext is configured but my
best guess is that you didn't provided the jars (addJars on SparkConf or
use the SC's constructor param).

Actually, the classes are not found on the slave which is another node,
another machine (or env), and so on. So to run your class it must be able
to load it -- which is handle by Spark but you simply have to pass it as an
argument...

This jar could be simply your current project packaged as a jar using
maven/sbt/...

HTH


On Wed, Apr 2, 2014 at 10:01 PM, yh18190 yh18...@gmail.com wrote:

 Hi Guys,

 Currently I am facing this issue ..Not able to find erros..
 here is sbt file.
 name := Simple Project

 version := 1.0

 scalaVersion := 2.10.3

 resolvers += bintray/meetup at http://dl.bintray.com/meetup/maven;

 resolvers += Akka Repository at http://repo.akka.io/releases/;

 resolvers += Cloudera Repository at
 https://repository.cloudera.com/artifactory/cloudera-repos/;

 libraryDependencies += org.apache.spark %% spark-core %
 0.9.0-incubating

 libraryDependencies += com.cloudphysics % jerkson_2.10 % 0.6.3

 libraryDependencies += org.apache.hadoop % hadoop-client %
 2.0.0-mr1-cdh4.6.0

 retrieveManaged := true

 output..

 [error] (run-main) org.apache.spark.SparkException: Job aborted: Task 2.0:2
 failed 4 times (most recent failure: Exception failure:
 java.lang.NoClassDefFoundError: Could not initialize class
 main.scala.Utils$)
 org.apache.spark.SparkException: Job aborted: Task 2.0:2 failed 4 times
 (most recent failure: Exception failure: java.lang.NoClassDefFoundError:
 Could not initialize class main.scala.Utils$)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Need-suggestions-tp3650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Need suggestions

2014-04-02 Thread andy petrella
Sorry I was not clear perhaps, anyway, could you try with the path in the
*List* to be the absolute one; e.g.
List(/home/yh/src/pj/spark-stuffs/target/scala-2.10/simple-project_2.10-1.0.jar)

In order to provide a relative path, you need first to figure out your CWD,
so you can do (to be really sure) do:
//before the new SparkContecxt)
println(new java.io.File(.).toUri.toString) // didn't try so adapt to
make scalax happy ^^




On Wed, Apr 2, 2014 at 11:30 PM, yh18190 yh18...@gmail.com wrote:

 Hi,
 Here is the sparkcontext feature.Do I need to any more extra jars to slaves
 separetely or this is enough?
 But i am able to see this created jar in my target directory..

  val sc = new SparkContext(spark://spark-master-001:7077, Simple App,
 utilclass.spark_home,
   List(target/scala-2.10/simple-project_2.10-1.0.jar))




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Need-suggestions-tp3650p3655.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
nope (what I said :-P)


On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella 
 andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro useful
 for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a pointer
 here as well to a new feature we are pretty excited about for Spark 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 






Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
Yes it could, of course. I didn't say that there is no tool to do it,
though ;-).

Andy


On Thu, Mar 27, 2014 at 12:49 PM, yana yana.kadiy...@gmail.com wrote:

 Does Shark not suit your needs? That's what we use at the moment and it's
 been good


 Sent from my Samsung Galaxy S®4


  Original message 
 From: andy petrella
 Date:03/27/2014 6:08 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Re: Announcing Spark SQL

 nope (what I said :-P)


 On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella 
 andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro
 useful for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a
 pointer here as well to a new feature we are pretty excited about for Spark
 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 







Re: [re-cont] map and flatMap

2014-03-15 Thread andy petrella
Yep,
Regarding flatMap and an implicit parameter might work like in scala's
future for instance:
https://github.com/scala/scala/blob/master/src/library/scala/concurrent/Future.scala#L246

Dunno, still waiting for some insights from the team ^^

andy

On Wed, Mar 12, 2014 at 3:23 PM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

 On Wed, Mar 12, 2014 at 3:06 PM, andy petrella andy.petre...@gmail.com
 wrote:

  Folks,
 
  I want just to pint something out...
  I didn't had time yet to sort it out and to think enough to give valuable
  strict explanation of -- event though, intuitively I feel they are a lot
  === need spark people or time to move forward.
  But here is the thing regarding *flatMap*.
 
  Actually, it looks like (and again intuitively makes sense) that RDD (and
  of course DStream) aren't monadic and it is reflected in the
 implementation
  (and signature) of flatMap.
 
  
   *  def flatMap[U: ClassTag](f: T = TraversableOnce[U]): RDD[U] = **
   new FlatMappedRDD(this, sc.clean(f))*
 
 
  There!? flatMap (or bind, =) should take a function that use the same
  Higher level abstraction in order to be considered as such right?
 
 
 I had remarked exactly the same thing and asked myself the same question...

 In this case, it takes a function that returns a TraversableOnce which is
  the type of the content of the RDD, and what represent the output is more
  the content of the RDD than the RDD itself (still right?).
 
  This actually breaks the understand of map and flatMap
 
   *def map[U: ClassTag](f: T = U): RDD[U] = new MappedRDD(this,
   sc.clean(f))*
 
 
  Indeed, RDD is a functor and the underlying reason for flatMap to not
 take
  A = RDD[B] doesn't show up in map.
 
  This has a lot of consequence actually, because at first one might want
 to
  create for-comprehension over RDDs, of even Traversable[F[_]] functions
  like sequence -- and he will get stuck since the signature aren't
  compliant.
  More importantly, Scala uses convention on the structure of a type to
 allow
  for-comp... so where Traversable[F[_]] will fail on type, for-comp will
  failed weirdly.
 

 +1


 
  Again this signature sounds normal, because my intuitive feeling about
 RDDs
  is that they *only can* be monadic but the composition would depend on
 the
  use case and might have heavy consequences (unioning the RDDs for
 instance
  = this happening behind the sea can be a big pain, since it wouldn't be
  efficient at all).
 
  So Yes, RDD could be monadic but with care.
 

 At least we can say, it is a Functor...
 Actually, I had imagined studying the monadic aspect of RDDs but as you
 said, it's not so easy...
 So for now, I consider them as pseudo-monadic ;)



  So what exposes this signature is a way to flatMap over the inner value,
  like it is almost the case for Map (flatMapValues)
 
  So, wouldn't be better to rename flatMap as flatMapData (or whatever
 better
  name)? Or to have flatMap requiring a Monad instance of RDD?
 
 
 renaming is to flatMapData or flatTraversableMap sounds good to me (even if
 lots of people will hate it...)
 flatMap requiring a Monad would make it impossible to use with
 for-comprehension certainly no?


  Sorry for the prose, just dropped my thoughts and feelings at once :-/
 
 
 I agree with you in case it can help not to feel alone ;)

 Pascal

 Cheers,
  andy
 
  PS: and my English maybe, although my name's Andy I'm a native Belgian
 ^^.