Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
+1 .
I can contribute to it as well .

On Tue, 19 Mar 2024 at 9:19 AM, Code Tutelage 
wrote:

> +1
>
> Thanks for proposing
>
> On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud
>  wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> dev@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


Re: Data Contracts

2023-06-19 Thread Deepak Sharma
Sorry for using simple in my last email .
It’s not gonna to be simple in any terms .
Thanks for sharing the git Philip .
Will definitely go through it .

Thanks
Deepak

On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry 
wrote:

> I think it might be a bit more complicated than this (but happy to be
> proved wrong).
>
> I have a minimum working example at:
>
> https://github.com/PhillHenry/SparkConstraints.git
>
> that runs out-of-the-box (mvn test) and demonstrates what I am trying to
> achieve.
>
> A test persists a DataFrame that conforms to the contract and demonstrates
> that one that does not, throws an Exception.
>
> I've had to slightly modify 3 Spark files to add the data contract
> functionality. If you can think of a more elegant solution, I'd be very
> grateful.
>
> Regards,
>
> Phillip
>
>
>
>
> On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma 
> wrote:
>
>> It can be as simple as adding a function to the spark session builder
>> specifically on the read  which can take the yaml file(definition if data
>> co tracts to be in yaml) and apply it to the data frame .
>> It can ignore the rows not matching the data contracts defined in the
>> yaml .
>>
>> Thanks
>> Deepak
>>
>> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry 
>> wrote:
>>
>>> For my part, I'm not too concerned about the mechanism used to implement
>>> the validation as long as it's rich enough to express the constraints.
>>>
>>> I took a look at JSON Schemas (for which there are a number of JVM
>>> implementations) but I don't think it can handle more complex data types
>>> like dates. Maybe Elliot can comment on this?
>>>
>>> Ideally, *any* reasonable mechanism could be plugged in.
>>>
>>> But what struck me from trying to write a Proof of Concept was that it
>>> was quite hard to inject my code into this particular area of the Spark
>>> machinery. It could very well be due to my limited understanding of the
>>> codebase, but it seemed the Spark code would need a bit of a refactor
>>> before a component could be injected. Maybe people in this forum with
>>> greater knowledge in this area could comment?
>>>
>>> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear
>>> to be attempting to implement data contracts within their ecosystem.
>>> Unfortunately, I think it's closed source and Python only.
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> It would be interesting if we think about creating a contract
>>>> validation library written in JSON format. This would ensure a validation
>>>> mechanism that will rely on this library and could be shared among relevant
>>>> parties. Will that be a starting point?
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> While I was at PayPal, we open sourced a template of Data Contract, it
>>>>> is here: https://github.com/paypal/data-contract-template. Companies
>>>>> like GX (Great Expectations) are interested in using it.
>>>>>
>>>>> Spark could read some elements form it pretty easily, like schema
>>>>> validation, some rules validations. Spark could also generate an embryo of
>>>>> data contracts…
>>>>>
>>>>> —jgp
>>>>>
>>>>>
>>>>> On Jun 13, 2

Re: Data Contracts

2023-06-19 Thread Deepak Sharma
ich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry 
>>> wrote:
>>>
>>>> Hi, Fokko and Deepak.
>>>>
>>>> The problem with DBT and Great Expectations (and Soda too, I believe)
>>>> is that by the time they find the problem, the error is already in
>>>> production - and fixing production can be a nightmare.
>>>>
>>>> What's more, we've found that nobody ever looks at the data quality
>>>> reports we already generate.
>>>>
>>>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but
>>>> it's usually against synthetic or at best sampled data (laws like GDPR
>>>> generally stop personal information data being anywhere but prod).
>>>>
>>>> What I'm proposing is something that stops production data ever being
>>>> tainted.
>>>>
>>>> Hi, Elliot.
>>>>
>>>> Nice to see you again (we worked together 20 years ago)!
>>>>
>>>> The problem here is that a schema itself won't protect me (at least as
>>>> I understand your argument). For instance, I have medical records that say
>>>> some of my patients are 999 years old which is clearly ridiculous but their
>>>> age correctly conforms to an integer data type. I have other patients who
>>>> were discharged *before* they were admitted to hospital. I have 28
>>>> patients out of literally millions who recently attended hospital but were
>>>> discharged on 1/1/1900. As you can imagine, this made the average length of
>>>> stay (a key metric for acute hospitals) much lower than it should have
>>>> been. It only came to light when some average length of stays were
>>>> negative!
>>>>
>>>> In all these cases, the data faithfully adhered to the schema.
>>>>
>>>> Hi, Ryan.
>>>>
>>>> This is an interesting point. There *should* indeed be a human
>>>> connection but often there isn't. For instance, I have a friend who
>>>> complained that his company's Zurich office made a breaking change and was
>>>> not even aware that his London based department existed, never mind
>>>> depended on their data. In large organisations, this is pretty common.
>>>>
>>>> TBH, my proposal doesn't address this particular use case (maybe hooks
>>>> and metastore listeners would...?) But my point remains that although these
>>>> relationships should exist, in a sufficiently large organisation, they
>>>> generally don't. And maybe we can help fix that with code?
>>>>
>>>> Would love to hear further thoughts.
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong 
>>>> wrote:
>>>>
>>>>> Hey Phillip,
>>>>>
>>>>> Thanks for raising this. I like the idea. The question is, should this
>>>>> be implemented in Spark or some other framework? I know that dbt has a 
>>>>> fairly
>>>>> extensive way of testing your data
>>>>> <https://www.getdbt.com/product/data-testing/>, and making sure that
>>>>> you can enforce assumptions on the columns. The nice thing about dbt is
>>>>> that it is built from a software engineering perspective, so all the tests
>>>>> (or contracts) are living in version control. Using pull requests you 
>>>>> could
>>>>> collaborate on changing the contract and making sure that the change has
>>>>> gotten enough attention before pushing it to production. Hope this helps!
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma <
>>>>> deepakmc...@gmail.com>:
>>>>>
>>>>>> Spark can be used with tools like great expectations as well to
>>>>>> implem

Re: Data Contracts

2023-06-12 Thread Deepak Sharma
Spark can be used with tools like great expectations as well to implement
the data contracts .
I am not sure though if spark alone can do the data contracts .
I was reading a blog on data mesh and how to glue it together with data
contracts , that’s where I came across this spark and great expectations
mention .

HTH

-Deepak

On Tue, 13 Jun 2023 at 12:48 AM, Elliot West  wrote:

> Hi Phillip,
>
> While not as fine-grained as your example, there do exist schema systems
> such as that in Avro that can can evaluate compatible and incompatible
> changes to the schema, from the perspective of the reader, writer, or both.
> This provides some potential degree of enforcement, and means to
> communicate a contract. Interestingly I believe this approach has been
> applied to both JsonSchema and protobuf as part of the Confluent Schema
> registry.
>
> Elliot.
>
> On Mon, 12 Jun 2023 at 12:43, Phillip Henry 
> wrote:
>
>> Hi, folks.
>>
>> There currently seems to be a buzz around "data contracts". From what I
>> can tell, these mainly advocate a cultural solution. But instead, could big
>> data tools be used to enforce these contracts?
>>
>> My questions really are: are there any plans to implement data
>> constraints in Spark (eg, an integer must be between 0 and 100; the date in
>> column X must be before that in column Y)? And if not, is there an appetite
>> for them?
>>
>> Maybe we could associate constraints with schema metadata that are
>> enforced in the implementation of a FileFormatDataWriter?
>>
>> Just throwing it out there and wondering what other people think. It's an
>> area that interests me as it seems that over half my problems at the day
>> job are because of dodgy data.
>>
>> Regards,
>>
>> Phillip
>>
>>


Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Deepak Sharma
Please count me in .
Can we have spark on k8s with spark-connect feature covered?

On Wed, 8 Feb 2023 at 10:03, Kirti Ruge  wrote:

> Greetings everyone,
> I would love to be part of this session.
> IST
>
>
> On Wed, 8 Feb 2023 at 9:13 AM, Colin Williams <
> colin.williams.seat...@gmail.com> wrote:
>
>> I wouldn't mind attending or viewing a recording depending on
>> availability. I'm interested in challenges and solutions to porting Spark
>> jobs between environments.
>>
>> On Tue, Feb 7, 2023 at 7:34 PM Denis Bolshakov 
>> wrote:
>>
>>> Hello,
>>>
>>> I am also interested, please add me to the conf.
>>>
>>> ср, 8 февр. 2023 г., 07:21 Jayabindu Singh :
>>>
 Greetings everyone!
 I am super new to this group and currently leading some work to deploy
 spark on k8 for my company o9 Solutions.
 I would love to join the discussion.
 I am in PST.

 Regards
 Jay

 Sent from my iPhone

 On Feb 7, 2023, at 3:57 PM, Mich Talebzadeh 
 wrote:

 
 Could be interesting. Need to summarise where we are with Spark on k8s
 and what market demands.

 My personal experience with Volcano was not that impressive 樂. So may
 be a summary will do.where we are currently with spark on k8s

 I am on  Greenwich Mean Time but I can take part in late sessions if
 needed


 HTH


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 7 Feb 2023 at 23:37, John Zhuge  wrote:

> Awesome, count me in!
> PST
>
> On Tue, Feb 7, 2023 at 3:34 PM Andrew Melo 
> wrote:
>
>> I'm Central US time (AKA UTC -6:00)
>>
>> On Tue, Feb 7, 2023 at 5:32 PM Holden Karau 
>> wrote:
>> >
>> > Awesome, I guess I should have asked folks for timezones that
>> they’re in.
>> >
>> > On Tue, Feb 7, 2023 at 3:30 PM Andrew Melo 
>> wrote:
>> >>
>> >> Hello Holden,
>> >>
>> >> We are interested in Spark on k8s and would like the opportunity to
>> >> speak with devs about what we're looking for slash better ways to
>> use
>> >> spark.
>> >>
>> >> Thanks!
>> >> Andrew
>> >>
>> >> On Tue, Feb 7, 2023 at 5:24 PM Holden Karau 
>> wrote:
>> >> >
>> >> > Hi Folks,
>> >> >
>> >> > It seems like we could maybe use some additional shared context
>> around Spark on Kube so I’d like to try and schedule a virtual coffee
>> session.
>> >> >
>> >> > Who all would be interested in virtual adventures around Spark
>> on Kube development?
>> >> >
>> >> > No pressure if the idea of hanging out in a virtual chat with
>> coffee and Spark devs does not sound like your thing, just trying to make
>> something informal so we can have a better understanding of everyone’s
>> goals here.
>> >> >
>> >> > Cheers,
>> >> >
>> >> > Holden :)
>> >> > --
>> >> > Twitter: https://twitter.com/holdenkarau
>> >> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> >
>> > --
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> John Zhuge
>



Re: Spark Issue with Istio in Distributed Mode

2022-09-12 Thread Deepak Sharma
Was able to resolve the idle connections being terminated issue using
EnvoyFilter

On Sat, 3 Sept 2022 at 18:14, Ilan Filonenko  wrote:

> Must be set in envoy (maybe could passthrough via istio)
>
> https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#envoy-v3-api-field-config-core-v3-httpprotocoloptions-idle-timeout
>
>
> On Sat, Sep 3, 2022 at 4:23 AM Deepak Sharma 
> wrote:
>
>> Thank for the reply IIan .
>> Can we set this in spark conf or does it need to goto istio / envoy conf?
>>
>>
>>
>> On Sat, 3 Sept 2022 at 10:28, Ilan Filonenko  wrote:
>>
>>> This might be a result of the idle_timeout that is configured in envoy.
>>> The default is an hour.
>>>
>>> On Sat, Sep 3, 2022 at 12:17 AM Deepak Sharma 
>>> wrote:
>>>
>>>> Hi All,
>>>> In 1 of our cluster , we enabled Istio where spark is running in
>>>> distributed mode.
>>>> Spark works fine when we run it with Istio in standalone mode.
>>>> In spark distributed mode , we are seeing that every 1 hour or so the
>>>> workers are getting disassociated from master and then master is not able
>>>> to spawn any jobs on these workers , until we restart spark rest server.
>>>>
>>>> Here is the error we see in the worker logs:
>>>>
>>>>
>>>> *ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to :
>>>> Driver spark-rest-service:44463 disassociated! Shutting down.*
>>>>
>>>> For 1 hour or so (until this issue happens) , spark distributed mode
>>>> works just fine.
>>>>
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>


Re: Spark Issue with Istio in Distributed Mode

2022-09-03 Thread Deepak Sharma
Thank for the reply IIan .
Can we set this in spark conf or does it need to goto istio / envoy conf?



On Sat, 3 Sept 2022 at 10:28, Ilan Filonenko  wrote:

> This might be a result of the idle_timeout that is configured in envoy.
> The default is an hour.
>
> On Sat, Sep 3, 2022 at 12:17 AM Deepak Sharma 
> wrote:
>
>> Hi All,
>> In 1 of our cluster , we enabled Istio where spark is running in
>> distributed mode.
>> Spark works fine when we run it with Istio in standalone mode.
>> In spark distributed mode , we are seeing that every 1 hour or so the
>> workers are getting disassociated from master and then master is not able
>> to spawn any jobs on these workers , until we restart spark rest server.
>>
>> Here is the error we see in the worker logs:
>>
>>
>> *ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to :
>> Driver spark-rest-service:44463 disassociated! Shutting down.*
>>
>> For 1 hour or so (until this issue happens) , spark distributed mode
>> works just fine.
>>
>>
>> Thanks
>> Deepak
>>
>


Spark Issue with Istio in Distributed Mode

2022-09-02 Thread Deepak Sharma
Hi All,
In 1 of our cluster , we enabled Istio where spark is running in
distributed mode.
Spark works fine when we run it with Istio in standalone mode.
In spark distributed mode , we are seeing that every 1 hour or so the
workers are getting disassociated from master and then master is not able
to spawn any jobs on these workers , until we restart spark rest server.

Here is the error we see in the worker logs:


*ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver
spark-rest-service:44463 disassociated! Shutting down.*

For 1 hour or so (until this issue happens) , spark distributed mode works
just fine.


Thanks
Deepak


Observability around Flink Pipeline/stateful functions

2021-07-22 Thread Deepak Sharma
@dev@spark.apache.org  @user 
I am looking for an example around the observability framework for Apache
Flink pipelines.
This could be message tracing across multiple flink pipelines or query on
the past state of a message that was processed by any flink pipeline.
If anyone has done similar work and can share any
pointers(blogs/books/writeup) , it would really help.

Thanks a lot in advance.

--Deepak


Write to same hdfs dir from multiple spark jobs

2020-07-29 Thread Deepak Sharma
Hi
Is there any design pattern around writing to the same hdfs directory from
multiple spark jobs?

-- 
Thanks
Deepak
www.bigdatabig.com


GroupBy issue while running K-Means - Dataframe

2020-06-16 Thread Deepak Sharma
Hi All,
I have a custom implementation of K-Means where it needs the data to be
grouped by a key in a dataframe.
Now there is a big data skew for some of the keys , where it exceeds the
BufferHolder:
 Cannot grow BufferHolder by size 17112 because the size after growing
exceeds size limitation 2147483632

I tried solving it by converting the dataframe to RDD and then using
reduceByKey on RDD and converting it back to RDD.
This gives Java Heap : Out of memory error.
Since it looks like a common issue , i was wondering how anyone would be
solving this problem ?
-- 
Thanks
Deepak


unsubscribe

2019-12-07 Thread Deepak Sharma



Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
I am using spark streaming application to read from  kafka.
The value coming from kafka message is path to hdfs file.
I am using spark 2.x , spark.read.stream.
What is the best way to read this path in spark streaming and then read the
json stored at the hdfs path , may be using spark.read.json , into a df
inside the spark streaming app.
Thanks a lot in advance

-- 
Thanks
Deepak


Re: welcoming Burak and Holden as committers

2017-01-24 Thread Deepak Sharma
Congratulations Holden & Burak

On Wed, Jan 25, 2017 at 8:23 AM, jiangxingbo  wrote:

> Congratulations Burak & Holden!
>
> > 在 2017年1月25日,上午2:13,Reynold Xin  写道:
> >
> > Hi all,
> >
> > Burak and Holden have recently been elected as Apache Spark committers.
> >
> > Burak has been very active in a large number of areas in Spark,
> including linear algebra, stats/maths functions in DataFrames, Python/R
> APIs for DataFrames, dstream, and most recently Structured Streaming.
> >
> > Holden has been a long time Spark contributor and evangelist. She has
> written a few books on Spark, as well as frequent contributions to the
> Python API to improve its usability and performance.
> >
> > Please join me in welcoming the two!
> >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Auto start spark jobs

2016-10-10 Thread Deepak Sharma
Hi All
Is there any way to schedule the ever running spark in such a way that it
comes up on its own , after the cluster maintenance?


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Reading back hdfs files saved as case class

2016-10-07 Thread Deepak Sharma
Thanks for the answer Reynold.
Yes I can use the dataset but it will solve the purpose I am supposed to
use it for.
I am trying to work on a solution where I need to save the case class along
with data in hdfs.
Further this data will move to different folders corresponding to different
case classes .
The spark programs reading these files are supposed to apply the case class
directly depending on the folder they are reading from.

Thanks
Deepak

On Oct 8, 2016 00:53, "Reynold Xin" <r...@databricks.com> wrote:

> You can use the Dataset API -- it should solve this issue for case classes
> that are not very complex.
>
> On Fri, Oct 7, 2016 at 12:20 PM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Hi
>> I am saving RDD[Example] in hdfs from spark program , where Example is
>> case class.
>> Now when i am trying to read it back , it returns RDD[String] with the
>> content as below:
>> *Example(1,name,value)*
>>
>> The workaround can be to write as a string in hdfs and read it back as
>> string and perform further processing.This way the case class name wouldn't
>> appear at all in the file being written in hdfs.
>> But i am keen to know if we can read the data directly in Spark if the
>> RDD[Case_Class] is written to hdfs?
>>
>> --
>> Thanks
>> Deepak
>>
>
>


Reading back hdfs files saved as case class

2016-10-07 Thread Deepak Sharma
Hi
I am saving RDD[Example] in hdfs from spark program , where Example is case
class.
Now when i am trying to read it back , it returns RDD[String] with the
content as below:
*Example(1,name,value)*

The workaround can be to write as a string in hdfs and read it back as
string and perform further processing.This way the case class name wouldn't
appear at all in the file being written in hdfs.
But i am keen to know if we can read the data directly in Spark if the
RDD[Case_Class] is written to hdfs?

-- 
Thanks
Deepak


Use cases around image/video processing in spark

2016-08-10 Thread Deepak Sharma
Hi
If anyone is using or knows about github repo that can help me get started
with image and video processing using spark.
The images/videos will be stored in s3 and i am planning to use s3 with
Spark.
In this case , how will spark achieve distributed processing?
Any code base or references is really appreciated.

-- 
Thanks
Deepak


How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2

How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of these 2 different
cse class type?
E.g first 11 fields mapped to Case1 while rest 6 fields mapped to Case2
Any pointer here or code snippet would be really helpful.


-- 
Thanks
Deepak