Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Thanks, you meant in a for loop. could you please put pseudocode in spark

On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke  wrote:

> Make every json object a line and then read t as jsonline not as multiline
>
> Am 19.06.2020 um 14:37 schrieb Chetan Khatri  >:
>
> 
> All transactions in JSON, It is not a single array.
>
> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner 
> wrote:
>
>> It's an interesting problem. What is the structure of the file? One big
>> array? On hash with many key-value pairs?
>>
>> Stephan
>>
>> On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hi Spark Users,
>>>
>>> I have a 50GB of JSON file, I would like to read and persist at HDFS so
>>> it can be taken into next transformation. I am trying to read as
>>> spark.read.json(path) but this is giving Out of memory error on driver.
>>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>>> is the best practice to read large JSON file like 50 GB?
>>>
>>> Thanks
>>>
>>
>>
>> --
>> Stephan Wehner, Ph.D.
>> The Buckmaster Institute, Inc.
>> 2150 Adanac Street
>> Vancouver BC V5L 2E7
>> Canada
>> Cell (604) 767-7415
>> Fax (888) 808-4655
>>
>> Sign up for our free email course
>> http://buckmaster.ca/small_business_website_mistakes.html
>>
>> http://www.buckmaster.ca
>> http://answer4img.com
>> http://loggingit.com
>> http://clocklist.com
>> http://stephansmap.org
>> http://benchology.com
>> http://www.trafficlife.com
>> http://stephan.sugarmotor.org (Personal Blog)
>> @stephanwehner (Personal Account)
>> VA7WSK (Personal call sign)
>>
>


Re: Reading TB of JSON file

2020-06-19 Thread Jörn Franke
Make every json object a line and then read t as jsonline not as multiline 

> Am 19.06.2020 um 14:37 schrieb Chetan Khatri :
> 
> 
> All transactions in JSON, It is not a single array. 
> 
>> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner  
>> wrote:
>> It's an interesting problem. What is the structure of the file? One big 
>> array? On hash with many key-value pairs?
>> 
>> Stephan
>> 
>>> On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri  
>>> wrote:
>>> Hi Spark Users,
>>> 
>>> I have a 50GB of JSON file, I would like to read and persist at HDFS so it 
>>> can be taken into next transformation. I am trying to read as 
>>> spark.read.json(path) but this is giving Out of memory error on driver. 
>>> Obviously, I can't afford having 50 GB on driver memory. In general, what 
>>> is the best practice to read large JSON file like 50 GB?
>>> 
>>> Thanks
>> 
>> 
>> -- 
>> Stephan Wehner, Ph.D.
>> The Buckmaster Institute, Inc.
>> 2150 Adanac Street
>> Vancouver BC V5L 2E7
>> Canada
>> Cell (604) 767-7415
>> Fax (888) 808-4655
>> 
>> Sign up for our free email course
>> http://buckmaster.ca/small_business_website_mistakes.html
>> 
>> http://www.buckmaster.ca
>> http://answer4img.com
>> http://loggingit.com
>> http://clocklist.com
>> http://stephansmap.org
>> http://benchology.com
>> http://www.trafficlife.com
>> http://stephan.sugarmotor.org (Personal Blog)
>> @stephanwehner (Personal Account)
>> VA7WSK (Personal call sign)


Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
All transactions in JSON, It is not a single array.

On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner 
wrote:

> It's an interesting problem. What is the structure of the file? One big
> array? On hash with many key-value pairs?
>
> Stephan
>
> On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri 
> wrote:
>
>> Hi Spark Users,
>>
>> I have a 50GB of JSON file, I would like to read and persist at HDFS so
>> it can be taken into next transformation. I am trying to read as
>> spark.read.json(path) but this is giving Out of memory error on driver.
>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>> is the best practice to read large JSON file like 50 GB?
>>
>> Thanks
>>
>
>
> --
> Stephan Wehner, Ph.D.
> The Buckmaster Institute, Inc.
> 2150 Adanac Street
> Vancouver BC V5L 2E7
> Canada
> Cell (604) 767-7415
> Fax (888) 808-4655
>
> Sign up for our free email course
> http://buckmaster.ca/small_business_website_mistakes.html
>
> http://www.buckmaster.ca
> http://answer4img.com
> http://loggingit.com
> http://clocklist.com
> http://stephansmap.org
> http://benchology.com
> http://www.trafficlife.com
> http://stephan.sugarmotor.org (Personal Blog)
> @stephanwehner (Personal Account)
> VA7WSK (Personal call sign)
>


Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Yes

On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta 
wrote:

> Hi,
> So you have a single JSON record in multiple lines?
> And all the 50 GB is in one file?
>
> Regards,
> Gourav
>
> On Thu, 18 Jun 2020, 14:34 Chetan Khatri, 
> wrote:
>
>> It is dynamically generated and written at s3 bucket not historical data
>> so I guess it doesn't have jsonlines format
>>
>> On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke  wrote:
>>
>>> Depends on the data types you use.
>>>
>>> Do you have in jsonlines format? Then the amount of memory plays much
>>> less a role.
>>>
>>> Otherwise if it is one large object or array I would not recommend it.
>>>
>>> > Am 18.06.2020 um 15:12 schrieb Chetan Khatri <
>>> chetan.opensou...@gmail.com>:
>>> >
>>> > 
>>> > Hi Spark Users,
>>> >
>>> > I have a 50GB of JSON file, I would like to read and persist at HDFS
>>> so it can be taken into next transformation. I am trying to read as
>>> spark.read.json(path) but this is giving Out of memory error on driver.
>>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>>> is the best practice to read large JSON file like 50 GB?
>>> >
>>> > Thanks
>>>
>>


Re: Reading TB of JSON file

2020-06-18 Thread Stephan Wehner
It's an interesting problem. What is the structure of the file? One big
array? On hash with many key-value pairs?

Stephan

On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri 
wrote:

> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it
> can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
>
> Thanks
>


-- 
Stephan Wehner, Ph.D.
The Buckmaster Institute, Inc.
2150 Adanac Street
Vancouver BC V5L 2E7
Canada
Cell (604) 767-7415
Fax (888) 808-4655

Sign up for our free email course
http://buckmaster.ca/small_business_website_mistakes.html

http://www.buckmaster.ca
http://answer4img.com
http://loggingit.com
http://clocklist.com
http://stephansmap.org
http://benchology.com
http://www.trafficlife.com
http://stephan.sugarmotor.org (Personal Blog)
@stephanwehner (Personal Account)
VA7WSK (Personal call sign)


Re: Reading TB of JSON file

2020-06-18 Thread Gourav Sengupta
Hi,
So you have a single JSON record in multiple lines?
And all the 50 GB is in one file?

Regards,
Gourav

On Thu, 18 Jun 2020, 14:34 Chetan Khatri, 
wrote:

> It is dynamically generated and written at s3 bucket not historical data
> so I guess it doesn't have jsonlines format
>
> On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke  wrote:
>
>> Depends on the data types you use.
>>
>> Do you have in jsonlines format? Then the amount of memory plays much
>> less a role.
>>
>> Otherwise if it is one large object or array I would not recommend it.
>>
>> > Am 18.06.2020 um 15:12 schrieb Chetan Khatri <
>> chetan.opensou...@gmail.com>:
>> >
>> > 
>> > Hi Spark Users,
>> >
>> > I have a 50GB of JSON file, I would like to read and persist at HDFS so
>> it can be taken into next transformation. I am trying to read as
>> spark.read.json(path) but this is giving Out of memory error on driver.
>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>> is the best practice to read large JSON file like 50 GB?
>> >
>> > Thanks
>>
>


Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
It is dynamically generated and written at s3 bucket not historical data so
I guess it doesn't have jsonlines format

On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke  wrote:

> Depends on the data types you use.
>
> Do you have in jsonlines format? Then the amount of memory plays much less
> a role.
>
> Otherwise if it is one large object or array I would not recommend it.
>
> > Am 18.06.2020 um 15:12 schrieb Chetan Khatri <
> chetan.opensou...@gmail.com>:
> >
> > 
> > Hi Spark Users,
> >
> > I have a 50GB of JSON file, I would like to read and persist at HDFS so
> it can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
> >
> > Thanks
>


Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
File is available at S3 Bucket.


On Thu, Jun 18, 2020 at 9:15 AM Patrick McCarthy 
wrote:

> Assuming that the file can be easily split, I would divide it into a
> number of pieces and move those pieces to HDFS before using spark at all,
> using `hdfs dfs` or similar. At that point you can use your executors to
> perform the reading instead of the driver.
>
> On Thu, Jun 18, 2020 at 9:12 AM Chetan Khatri 
> wrote:
>
>> Hi Spark Users,
>>
>> I have a 50GB of JSON file, I would like to read and persist at HDFS so
>> it can be taken into next transformation. I am trying to read as
>> spark.read.json(path) but this is giving Out of memory error on driver.
>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>> is the best practice to read large JSON file like 50 GB?
>>
>> Thanks
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Reading TB of JSON file

2020-06-18 Thread nihed mbarek
Hi,

What is the size of one json document ?

There is also the scan of your json to define the schema, the overhead can
be huge.
2 solution:
define a schema and use directly during the load or ask spark to analyse a
small part of the json file (I don't remember how to do it)

Regards,


On Thu, Jun 18, 2020 at 3:12 PM Chetan Khatri 
wrote:

> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it
> can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
>
> Thanks
>


-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: Reading TB of JSON file

2020-06-18 Thread Jörn Franke
Depends on the data types you use.

Do you have in jsonlines format? Then the amount of memory plays much less a 
role.

Otherwise if it is one large object or array I would not recommend it.

> Am 18.06.2020 um 15:12 schrieb Chetan Khatri :
> 
> 
> Hi Spark Users,
> 
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it 
> can be taken into next transformation. I am trying to read as 
> spark.read.json(path) but this is giving Out of memory error on driver. 
> Obviously, I can't afford having 50 GB on driver memory. In general, what is 
> the best practice to read large JSON file like 50 GB?
> 
> Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy
Assuming that the file can be easily split, I would divide it into a number
of pieces and move those pieces to HDFS before using spark at all, using
`hdfs dfs` or similar. At that point you can use your executors to perform
the reading instead of the driver.

On Thu, Jun 18, 2020 at 9:12 AM Chetan Khatri 
wrote:

> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it
> can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
>
> Thanks
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it
can be taken into next transformation. I am trying to read as
spark.read.json(path) but this is giving Out of memory error on driver.
Obviously, I can't afford having 50 GB on driver memory. In general, what
is the best practice to read large JSON file like 50 GB?

Thanks