date:20180619

Re: How can I do the following simple scenario in spark

2018-06-19 Thread Sonal Goyal

Try flatMapToPair instead of flatMap

Thanks,
Sonal
Nube Technologies 





On Tue, Jun 19, 2018 at 11:08 PM, Soheil Pourbafrani 
wrote:

> Hi, I have a JSON file in the following structure:
> ++---+
> |   full_text| id|
> ++---+
>
> I want to tokenize each sentence into pairs of (word, id)
>
> for example, having the record : ("Hi, How are you?", id) I want to
> convert the dataframe to:
> hi, id
> how, id
> are, id
> you, id
> ?, id
>
> So I try :
>
> data.rdd.map(lambda data : (data[0], data[1]))\
>.flatMap(lambda row: (word_tokenize(row[0].lower()), row[1])
>
> but it converts the dataframe to:
> [hi, how, are, you, ?]
>
> How can I do the desired transformation?
>

[Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-19 Thread mattl156

Hello,

 

We have a number of Hive tables (non partitioned) that are populated with
subdirectories. (result of tez execution engine union queries)

 

E.g. Table location: “s3://table1/” With the actual data residing in:

 

s3://table1/1/data1

s3://table1/2/data2

s3://table1/3/data3

 

When using SparkSession (sql/hiveContext has the same behavior) and
spark.sql to query the data, no records are displayed due to these
subdirectories.



e.g 

val df = spark.sql("select * from db.table1").show()



I’ve tried a number of setConf properties e.g.
spark.hive.mapred.supports.subdirectories=true,
mapreduce.input.fileinputformat.input.dir.recursive=true but it does not
look like any of these properties are supported. 

 

Has anyone run into similar problems or ways to resolve it? Our current
alternatives are reading the input path directory directly e.g.:




spark.read.csv("s3://bucket-name/table1/bullseye_segments/*/*")


But this requires prior knowledge of the path or an extra step to determine
it. 
 

Thanks,

Matt





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark DF to Hive table with both Partition and Bucketing not working

2018-06-19 Thread Subhash Sriram

Hi Umar,

Could it be that spark.sql.sources.bucketing.enabled is not set to true? 

Thanks,
Subhash

Sent from my iPhone

> On Jun 19, 2018, at 11:41 PM, umargeek  wrote:
> 
> Hi Folks,
> 
> I am trying to save a spark data frame after reading from ORC file and add
> two new columns and finally trying to save it to hive table with both
> partition and bucketing feature.
> 
> Using Spark 2.3 (as both partition and bucketing feature are available in
> this version).
> 
> Looking for advise.
> 
> Code Snippet:
> 
> df_orc_data =
> spark.read.format("orc").option("delimiter","|").option("header",
> "true").option("inferschema", "true").load(filtered_path)
> df_fil_ts_data = df_orc_data.withColumn("START_TS",
> lit(process_time).cast("timestamp"))
> daily = (datetime.datetime.utcnow().strftime('%Y-%m-%d'))
> df_filtered_data =
> df_fil_ts_data.withColumn("DAYPART",lit(daily).cast("string")
> hour = (datetime.datetime.utcnow().strftime('%H'))
> df_filtered = df_filtered_data.withColumn("HRS",lit(hour).cast("string"))
> (df_filtered.write.partitionBy("DAYPART").bucketBy(24,"HRS").sortBy("HRS").mode("append").orc('/user/umar/netflow_filtered').saveAsTable("default.DDOS_NETFLOW_FILTERED"))
> 
> Error:
> "'save' does not support bucketing right now;"
> 
> 
> 
> Thanks,
> Umar
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to validate orc vectorization is working within spark application?

2018-06-19 Thread Jörn Franke

Full code? What is expected performance and actual ?
What is the use case?

> On 20. Jun 2018, at 05:33, umargeek  wrote:
> 
> Hi Folks,
> 
> I would just require few pointers on the above query w.r.t vectorization
> looking forward for support from the community.
> 
> Thanks,
> Umar
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark DF to Hive table with both Partition and Bucketing not working

2018-06-19 Thread umargeek

Hi Folks,

I am trying to save a spark data frame after reading from ORC file and add
two new columns and finally trying to save it to hive table with both
partition and bucketing feature.

Using Spark 2.3 (as both partition and bucketing feature are available in
this version).

Looking for advise.

Code Snippet:

df_orc_data =
spark.read.format("orc").option("delimiter","|").option("header",
"true").option("inferschema", "true").load(filtered_path)
df_fil_ts_data = df_orc_data.withColumn("START_TS",
lit(process_time).cast("timestamp"))
daily = (datetime.datetime.utcnow().strftime('%Y-%m-%d'))
df_filtered_data =
df_fil_ts_data.withColumn("DAYPART",lit(daily).cast("string")
hour = (datetime.datetime.utcnow().strftime('%H'))
df_filtered = df_filtered_data.withColumn("HRS",lit(hour).cast("string"))
(df_filtered.write.partitionBy("DAYPART").bucketBy(24,"HRS").sortBy("HRS").mode("append").orc('/user/umar/netflow_filtered').saveAsTable("default.DDOS_NETFLOW_FILTERED"))

Error:
"'save' does not support bucketing right now;"



Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Anomaly when dealing with Unix timestamp

2018-06-19 Thread Raymond Xie

Hello,

I have a dataframe, apply from_unixtime seems to expose an anomaly:

*scala> val bhDF4 = bhDF.withColumn("ts1", $"ts" + 28800).withColumn("ts2",
from_unixtime($"ts" + 28800,"MMddhhmmss"))*
*bhDF4: org.apache.spark.sql.DataFrame = [user_id: int, item_id: int ... 5
more fields]*

*scala> bhDF4.show*
*+---+---+---++--+--+--+*
*|user_id|item_id| cat_id|behavior|ts|   ts1|   ts2|*
*+---+---+---++--+--+--+*
*|  1|2268318|2520377|  pv|1511544070|1511572870|20171124082110|*
*|  1|246|2520771|  pv|1511561733|1511590533|20171125011533|*
*|  1|2576651| 149192|  pv|1511572885|1511601685|20171125042125|*
*|  1|3830808|4181361|  pv|1511593493|1511622293|20171125100453|*
*|  1|4365585|2520377|  pv|1511596146|1511624946|20171125104906|*
*|  1|4606018|2735466|  pv|1511616481|1511645281|20171125042801|*
*|  1| 230380| 411153|  pv|1511644942|1511673742|2017112612|*
*|  1|3827899|2920476|  pv|1511713473|1511742273|20171126072433|*
*|  1|3745169|2891509|  pv|1511725471|1511754271|20171126104431|*
*|  1|1531036|2920476|  pv|1511733732|1511762532|20171127010212|*
*|  1|2266567|4145813|  pv|1511741471|1511770271|2017112703|*
*|  1|2951368|1080785|  pv|1511750828|1511779628|20171127054708|*
*|  1|3108797|2355072|  pv|1511758881|1511787681|20171127080121|*
*|  1|1338525| 149192|  pv|1511773214|1511802014|20171127120014|*
*|  1|2286574|2465336|  pv|1511797167|1511825967|20171127063927|*
*|  1|5002615|2520377|  pv|1511839385|1511868185|20171128062305|*
*|  1|2734026|4145813|  pv|1511842184|1511870984|20171128070944|*
*|  1|5002615|2520377|  pv|1511844273|1511873073|20171128074433|*
*|  1|3239041|2355072|  pv|1511855664|1511884464|20171128105424|*
*|  1|4615417|4145813|  pv|1511870864|1511899664|20171128030744|*
*+---+---+---++--+--+--+*
*only showing top 20 rows*


*All ts2 are supposed to show date after 20171125 while there seems to be
at least one anomaly showing 20171124*

*Any thought?*



*Sincerely yours,*


*Raymond*

Re: How to validate orc vectorization is working within spark application?

2018-06-19 Thread umargeek

Hi Folks,

I would just require few pointers on the above query w.r.t vectorization
looking forward for support from the community.

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie

Thank you, that works.


**
*Sincerely yours,*


*Raymond*

On Tue, Jun 19, 2018 at 4:36 PM, Nicolas Paris  wrote:

> Hi Raymond
>
> Spark works well on single machine too, since it benefits from multiple
> core.
> The csv parser is based on univocity and you might use the
> "spark.read.csc" syntax instead of using the rdd api;
>
> From my experience, this will better than any other csv  parser
>
> 2018-06-19 16:43 GMT+02:00 Raymond Xie :
>
>> Thank you Matteo, Askash and Georg:
>>
>> I am attempting to get some stats first, the data is like:
>>
>> 1,4152983,2355072,pv,1511871096
>>
>> I like to find out the count of Key of (UserID, Behavior Type)
>>
>> val bh_count = 
>> sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
>>  => ((x(0).toInt,x(3)),1)).groupByKey()
>>
>> This shows me:
>> scala> val first = bh_count.first
>> [Stage 1:>  (0 +
>> 1) / 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak
>> detected; size = 15848112 bytes, TID = 110
>> first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1,
>> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
>> 1, 1))
>>
>>
>> *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
>> it in Windows where I have more RAM instead of Ubuntu so the env differs to
>> what I said in the original email)*
>> *Dataset is 3.6GB*
>>
>> *Thank you very much.*
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>> On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu  wrote:
>>
>>> Single machine? Any other framework will perform better than Spark
>>>
>>> On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
>>> wrote:
>>>
 Georg, just asking, can Pandas handle such a big dataset? If that data
 is further passed into using any of the sklearn modules?

 On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler <
 georg.kf.hei...@gmail.com> wrote:

> use pandas or dask
>
> If you do want to use spark store the dataset as parquet / orc. And
> then continue to perform analytical queries on that dataset.
>
> Raymond Xie  schrieb am Di., 19. Juni 2018 um
> 04:29 Uhr:
>
>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
>> environment is 20GB ssd harddisk and 2GB RAM.
>>
>> The dataset comes with
>> User ID: 987,994
>> Item ID: 4,162,024
>> Category ID: 9,439
>> Behavior type ('pv', 'buy', 'cart', 'fav')
>> Unix Timestamp: span between November 25 to December 03, 2017
>>
>> I would like to hear any suggestion from you on how should I process
>> the dataset with my current environment.
>>
>> Thank you.
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>

>>
>

Re: Best way to process this dataset

2018-06-19 Thread Nicolas Paris

Hi Raymond

Spark works well on single machine too, since it benefits from multiple
core.
The csv parser is based on univocity and you might use the
"spark.read.csc" syntax instead of using the rdd api;

>From my experience, this will better than any other csv  parser

2018-06-19 16:43 GMT+02:00 Raymond Xie :

> Thank you Matteo, Askash and Georg:
>
> I am attempting to get some stats first, the data is like:
>
> 1,4152983,2355072,pv,1511871096
>
> I like to find out the count of Key of (UserID, Behavior Type)
>
> val bh_count = 
> sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
>  => ((x(0).toInt,x(3)),1)).groupByKey()
>
> This shows me:
> scala> val first = bh_count.first
> [Stage 1:>  (0 +
> 1) / 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak
> detected; size = 15848112 bytes, TID = 110
> first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1))
>
>
> *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
> it in Windows where I have more RAM instead of Ubuntu so the env differs to
> what I said in the original email)*
> *Dataset is 3.6GB*
>
> *Thank you very much.*
> **
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu  wrote:
>
>> Single machine? Any other framework will perform better than Spark
>>
>> On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
>> wrote:
>>
>>> Georg, just asking, can Pandas handle such a big dataset? If that data
>>> is further passed into using any of the sklearn modules?
>>>
>>> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler <
>>> georg.kf.hei...@gmail.com> wrote:
>>>
 use pandas or dask

 If you do want to use spark store the dataset as parquet / orc. And
 then continue to perform analytical queries on that dataset.

 Raymond Xie  schrieb am Di., 19. Juni 2018 um
 04:29 Uhr:

> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
> environment is 20GB ssd harddisk and 2GB RAM.
>
> The dataset comes with
> User ID: 987,994
> Item ID: 4,162,024
> Category ID: 9,439
> Behavior type ('pv', 'buy', 'cart', 'fav')
> Unix Timestamp: span between November 25 to December 03, 2017
>
> I would like to hear any suggestion from you on how should I process
> the dataset with my current environment.
>
> Thank you.
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>

>>>
>

How can I do the following simple scenario in spark

2018-06-19 Thread Soheil Pourbafrani

Hi, I have a JSON file in the following structure:
++---+
|   full_text| id|
++---+

I want to tokenize each sentence into pairs of (word, id)

for example, having the record : ("Hi, How are you?", id) I want to convert
the dataframe to:
hi, id
how, id
are, id
you, id
?, id

So I try :

data.rdd.map(lambda data : (data[0], data[1]))\
   .flatMap(lambda row: (word_tokenize(row[0].lower()), row[1])

but it converts the dataframe to:
[hi, how, are, you, ?]

How can I do the desired transformation?

Re: How to set spark.driver.memory?

2018-06-19 Thread Prem Sure

Hi, Can you share the exception?
You need to give the value as well right after --driver-memory. First
preference goes to the config keyval pairs defined in spark-submit and then
only to spark-defaults.con.
You can refer docs for the exact variable name

Thanks,
Prem

On Tue, Jun 19, 2018 at 5:47 PM onmstester onmstester 
wrote:

> I have a spark cluster containing 3 nodes and my application is a jar file
> running by java -jar .
> How can i set driver.memory for my application?
> spark-defaults.conf only would be read by ./spark-summit
> "java --driver-memory -jar " fails with exception.
>
> Sent using Zoho Mail 
>
>
>

spark kafka consumer with kerberos - login error

2018-06-19 Thread Amol Zambare

I am working on a spark job which reads from kafka topic and write to HDFS 
however while submitting the job using spark-submit command I am getting 
following error.


Error log

 Caused by: org.apache.kafka.common.KafkaException: 
javax.security.auth.login.LoginException: Could not login: the client is being 
asked for a password, but the Kafka client code does not currently support 
obtaining a password from the user. Make sure -Djava.security.auth.login.config 
property passed to JVM and the client is configured to use a ticket cache 
(using the JAAS configuration setting 'useTicketCache=true)'. Make sure you are 
using FQDN of the Kafka broker you are trying to connect to. not available to 
garner  authentication information from the user


I am passing the user keytab and kafka_client_jaas.conf file to spark submit 
command as suggested in Hortonworks documentation or 
https://github.com/hortonworks-spark/skc#running-on-a-kerberos-enabled-cluster


Passing following parameters to spark-submit

--files user.keytab,kafka_client_jaas.conf \

--driver-java-options 
"-Djava.security.auth.login.config=kafka_client_jaas.conf" \

--conf 
"spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas.conf"
 \

Version information

Spark version - 2.2.0.2.6.4.0-91
Kafka version - 0.10.1

Any help is much appreciated.

Thanks,
Amol

Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie

Thank you Matteo, Askash and Georg:

I am attempting to get some stats first, the data is like:

1,4152983,2355072,pv,1511871096

I like to find out the count of Key of (UserID, Behavior Type)

val bh_count = 
sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
=> ((x(0).toInt,x(3)),1)).groupByKey()

This shows me:
scala> val first = bh_count.first
[Stage 1:>  (0 + 1)
/ 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak detected;
size = 15848112 bytes, TID = 110
first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1))


*Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
it in Windows where I have more RAM instead of Ubuntu so the env differs to
what I said in the original email)*
*Dataset is 3.6GB*

*Thank you very much.*
**
*Sincerely yours,*


*Raymond*

On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu  wrote:

> Single machine? Any other framework will perform better than Spark
>
> On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
> wrote:
>
>> Georg, just asking, can Pandas handle such a big dataset? If that data is
>> further passed into using any of the sklearn modules?
>>
>> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler > > wrote:
>>
>>> use pandas or dask
>>>
>>> If you do want to use spark store the dataset as parquet / orc. And then
>>> continue to perform analytical queries on that dataset.
>>>
>>> Raymond Xie  schrieb am Di., 19. Juni 2018 um
>>> 04:29 Uhr:
>>>
 I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
 environment is 20GB ssd harddisk and 2GB RAM.

 The dataset comes with
 User ID: 987,994
 Item ID: 4,162,024
 Category ID: 9,439
 Behavior type ('pv', 'buy', 'cart', 'fav')
 Unix Timestamp: span between November 25 to December 03, 2017

 I would like to hear any suggestion from you on how should I process
 the dataset with my current environment.

 Thank you.

 **
 *Sincerely yours,*


 *Raymond*

>>>
>>

Re: spark-shell doesn't start

2018-06-19 Thread Irving Duran

You are trying to run "spark-shell" as a command which is not in your
environment.  You might want to do "./spark-shell" or try "sudo ln -s
/path/to/spark-shell /usr/bin/spark-shell" and then do "spark-shell".

Thank You,

Irving Duran

On Sun, Jun 17, 2018 at 6:53 AM Raymond Xie  wrote:

> Hello, I am doing the practice in Ubuntu now, here is the error I am
> encountering:
>
>
> rxie@ubuntu:~/Downloads/spark/bin$ spark-shell
> Error: Could not find or load main class org.apache.spark.launcher.Main
>
>
> What am I missing?
>
> Thank you very much.
>
> Java is installed.
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>

Re: [Spark] Supporting python 3.5?

2018-06-19 Thread Irving Duran

Cool, thanks for the validation!

Thank You,

Irving Duran


On Thu, May 24, 2018 at 8:20 PM Jeff Zhang  wrote:

>
> It supports python 3.5, and IIRC, spark also support python 3.6
>
> Irving Duran 于2018年5月10日周四 下午9:08写道：
>
>> Does spark now support python 3.5 or it is just 3.4.x?
>>
>> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>>
>> Thank You,
>>
>> Irving Duran
>>
>

How to set spark.driver.memory?

2018-06-19 Thread onmstester onmstester

I have a spark cluster containing 3 nodes and my application is a jar file 
running by java -jar . 

How can i set driver.memory for my application?

spark-defaults.conf only would be read by ./spark-summit

"java --driver-memory -jar " fails with exception.


Sent using Zoho Mail

enable jmx in standalone mode

2018-06-19 Thread onmstester onmstester

How to enable jmx for spark worker/executor/driver in standalone mode?

i have add these:

spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ 
-Dcom.sun.management.jmxremote.port=9178 \ 
-Dcom.sun.management.jmxremote.authenticate=false \ 
-Dcom.sun.management.jmxremote.ssl=false 
spark.executor.extraJavaOptions=-Dcom.sun.management.jmxremote \ 
-Dcom.sun.management.jmxremote.port=0\ 
-Dcom.sun.management.jmxremote.authenticate=false \ 
-Dcom.sun.management.jmxremote.ssl=false
to spark/conf/spark-defaults.conf

run stop-slave.sh

and then start-slave.sh

using netstat -anop|grep executor-pid there is no port other than spark 
api port associated to process
Sent using Zoho Mail

Re: Best way to process this dataset

2018-06-19 Thread Matteo Cossu

Single machine? Any other framework will perform better than Spark

On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
wrote:

> Georg, just asking, can Pandas handle such a big dataset? If that data is
> further passed into using any of the sklearn modules?
>
> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler 
> wrote:
>
>> use pandas or dask
>>
>> If you do want to use spark store the dataset as parquet / orc. And then
>> continue to perform analytical queries on that dataset.
>>
>> Raymond Xie  schrieb am Di., 19. Juni 2018 um
>> 04:29 Uhr:
>>
>>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment
>>> is 20GB ssd harddisk and 2GB RAM.
>>>
>>> The dataset comes with
>>> User ID: 987,994
>>> Item ID: 4,162,024
>>> Category ID: 9,439
>>> Behavior type ('pv', 'buy', 'cart', 'fav')
>>> Unix Timestamp: span between November 25 to December 03, 2017
>>>
>>> I would like to hear any suggestion from you on how should I process the
>>> dataset with my current environment.
>>>
>>> Thank you.
>>>
>>> **
>>> *Sincerely yours,*
>>>
>>>
>>> *Raymond*
>>>
>>
>

Re: Best way to process this dataset

2018-06-19 Thread Aakash Basu

Georg, just asking, can Pandas handle such a big dataset? If that data is
further passed into using any of the sklearn modules?

On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler 
wrote:

> use pandas or dask
>
> If you do want to use spark store the dataset as parquet / orc. And then
> continue to perform analytical queries on that dataset.
>
> Raymond Xie  schrieb am Di., 19. Juni 2018 um
> 04:29 Uhr:
>
>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment
>> is 20GB ssd harddisk and 2GB RAM.
>>
>> The dataset comes with
>> User ID: 987,994
>> Item ID: 4,162,024
>> Category ID: 9,439
>> Behavior type ('pv', 'buy', 'cart', 'fav')
>> Unix Timestamp: span between November 25 to December 03, 2017
>>
>> I would like to hear any suggestion from you on how should I process the
>> dataset with my current environment.
>>
>> Thank you.
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>

Re: How can I do the following simple scenario in spark

[Spark SQL]: How to read Hive tables with Sub directories - is this supported?

Re: Spark DF to Hive table with both Partition and Bucketing not working

Re: How to validate orc vectorization is working within spark application?

Spark DF to Hive table with both Partition and Bucketing not working

Anomaly when dealing with Unix timestamp

Re: How to validate orc vectorization is working within spark application?

Re: Best way to process this dataset

Re: Best way to process this dataset

How can I do the following simple scenario in spark

Re: How to set spark.driver.memory?

spark kafka consumer with kerberos - login error

Re: Best way to process this dataset

Re: spark-shell doesn't start

Re: [Spark] Supporting python 3.5?

How to set spark.driver.memory?

enable jmx in standalone mode

Re: Best way to process this dataset

Re: Best way to process this dataset

19 matches

Site Navigation

Mail list logo

Footer information