Re: Best way to process this dataset

Matteo Cossu Tue, 19 Jun 2018 01:05:40 -0700

Single machine? Any other framework will perform better than Spark

On Tue, 19 Jun 2018 at 09:40, Aakash Basu <aakash.spark....@gmail.com>
wrote:


> Georg, just asking, can Pandas handle such a big dataset? If that data is
> further passed into using any of the sklearn modules?
>
> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler <georg.kf.hei...@gmail.com>
> wrote:
>
>> use pandas or dask
>>
>> If you do want to use spark store the dataset as parquet / orc. And then
>> continue to perform analytical queries on that dataset.
>>
>> Raymond Xie <xie3208...@gmail.com> schrieb am Di., 19. Juni 2018 um
>> 04:29 Uhr:
>>
>>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment
>>> is 20GB ssd harddisk and 2GB RAM.
>>>
>>> The dataset comes with
>>> User ID: 987,994
>>> Item ID: 4,162,024
>>> Category ID: 9,439
>>> Behavior type ('pv', 'buy', 'cart', 'fav')
>>> Unix Timestamp: span between November 25 to December 03, 2017
>>>
>>> I would like to hear any suggestion from you on how should I process the
>>> dataset with my current environment.
>>>
>>> Thank you.
>>>
>>> *------------------------------------------------*
>>> *Sincerely yours,*
>>>
>>>
>>> *Raymond*
>>>
>>
>

Re: Best way to process this dataset

Reply via email to