Re: Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
The context is different here.
The file path are coming as messages in kafka topic.
Spark streaming (structured) consumes form this topic.
Now it have to get the value from the message , thus the path to file.
read the json stored at the file location into another df.

Thanks
Deepak

On Sun, Jun 9, 2019 at 11:03 PM vaquar khan  wrote:

> Hi Deepak,
>
> You can use textFileStream.
>
> https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
>
> Plz start using stackoverflow to ask question to other ppl so get benefits
> of answer
>
>
> Regards,
> Vaquar khan
>
> On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma  wrote:
>
>> I am using spark streaming application to read from  kafka.
>> The value coming from kafka message is path to hdfs file.
>> I am using spark 2.x , spark.read.stream.
>> What is the best way to read this path in spark streaming and then read
>> the json stored at the hdfs path , may be using spark.read.json , into a df
>> inside the spark streaming app.
>> Thanks a lot in advance
>>
>> --
>> Thanks
>> Deepak
>>
>

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: High level explanation of dropDuplicates

2019-06-09 Thread Rishi Shah
Hi All,

Just wanted to check back regarding best way to perform deduplication. Is
using drop duplicates the optimal way to get rid of duplicates? Would it be
better if we run operations on red directly?

Also what about if we want to keep the last value of the group while
performing deduplication (based on some sorting criteria)?

Thanks,
Rishi

On Mon, May 20, 2019 at 3:33 PM Nicholas Hakobian <
nicholas.hakob...@rallyhealth.com> wrote:

> From doing some searching around in the spark codebase, I found the
> following:
>
>
> https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474
>
> So it appears there is no direct operation called dropDuplicates or
> Deduplicate, but there is an optimizer rule that converts this logical
> operation to a physical operation that is equivalent to grouping by all the
> columns you want to deduplicate across (or all columns if you are doing
> something like distinct), and taking the First() value. So (using a pySpark
> code example):
>
> df = input_df.dropDuplicates(['col1', 'col2'])
>
> Is effectively shorthand for saying something like:
>
> df = input_df.groupBy('col1',
> 'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*')
>
> Except I assume that it has some internal optimization so it doesn't need
> to pack/unpack the column data, and just returns the whole Row.
>
> Nicholas Szandor Hakobian, Ph.D.
> Principal Data Scientist
> Rally Health
> nicholas.hakob...@rallyhealth.com
>
>
>
> On Mon, May 20, 2019 at 11:38 AM Yeikel  wrote:
>
>> Hi ,
>>
>> I am looking for a high level explanation(overview) on how
>> dropDuplicates[1]
>> works.
>>
>> [1]
>>
>> https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326
>>
>> Could someone please explain?
>>
>> Thank you
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

-- 
Regards,

Rishi Shah


[pyspark 2.3+] Querying non-partitioned @TB data table is too slow

2019-06-09 Thread Rishi Shah
Hi All,

I have a table with 3TB data, stored as parquet snappy compression - 100
columns.. However I am filtering the DataFrame on date column (date between
20190501-20190530) & selecting only 20 columns & counting.. This operation
takes about 45 mins!!

Shouldn't parquet do the predicate pushdown and filtering without scanning
the entire dataset?

-- 
Regards,

Rishi Shah


Re: Read hdfs files in spark streaming

2019-06-09 Thread vaquar khan
Hi Deepak,

You can use textFileStream.

https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

Plz start using stackoverflow to ask question to other ppl so get benefits
of answer


Regards,
Vaquar khan

On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma  wrote:

> I am using spark streaming application to read from  kafka.
> The value coming from kafka message is path to hdfs file.
> I am using spark 2.x , spark.read.stream.
> What is the best way to read this path in spark streaming and then read
> the json stored at the hdfs path , may be using spark.read.json , into a df
> inside the spark streaming app.
> Thanks a lot in advance
>
> --
> Thanks
> Deepak
>


Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread Jörn Franke
Depending on what accuracy is needed, hyperloglogs can be an interesting 
alternative 
https://en.m.wikipedia.org/wiki/HyperLogLog

> Am 09.06.2019 um 15:59 schrieb big data :
> 
> From m opinion, Bitmap is the best solution for active users calculation. 
> Other solution almost bases on count(distinct) calculation process, which is 
> more slower.
> 
> If you 've implemented Bitmap solution including how to build Bitmap, how to 
> load Bitmap, then Bitmap is the best choice.
> 
>> 在 2019/6/5 下午6:49, Rishi Shah 写道:
>> Hi All,
>> 
>> Is there a best practice around calculating daily, weekly, monthly, 
>> quarterly, yearly active users?
>> 
>> One approach is to create a window of daily bitmap and aggregate it based on 
>> period later. However I was wondering if anyone has a better approach to 
>> tackling this problem.. 
>> 
>> -- 
>> Regards,
>> 
>> Rishi Shah


Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread big data
From m opinion, Bitmap is the best solution for active users calculation. Other 
solution almost bases on count(distinct) calculation process, which is more 
slower.

If you 've implemented Bitmap solution including how to build Bitmap, how to 
load Bitmap, then Bitmap is the best choice.

在 2019/6/5 下午6:49, Rishi Shah 写道:
Hi All,

Is there a best practice around calculating daily, weekly, monthly, quarterly, 
yearly active users?

One approach is to create a window of daily bitmap and aggregate it based on 
period later. However I was wondering if anyone has a better approach to 
tackling this problem..

--
Regards,

Rishi Shah


Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
I am using spark streaming application to read from  kafka.
The value coming from kafka message is path to hdfs file.
I am using spark 2.x , spark.read.stream.
What is the best way to read this path in spark streaming and then read the
json stored at the hdfs path , may be using spark.read.json , into a df
inside the spark streaming app.
Thanks a lot in advance

-- 
Thanks
Deepak