Spark Dataset transformations for time based events

2018-12-25 Thread Debajyoti Roy
Hope everyone is enjoying their holidays.

If anyone here ran into these time based event transformation patterns or
have a strong opinion about the approach please let me know / reply in SO:

   1. Enrich using as-of-time:
   
https://stackoverflow.com/questions/53928880/how-to-do-a-time-based-as-of-join-of-two-datasets-in-apache-spark
   2. Snapshot of state with time to state with effective start and end
   time:
   
https://stackoverflow.com/questions/53928372/given-dataset-of-state-snapshots-at-time-t-how-to-transform-it-into-dataset-with/53928400#53928400


Thanks in advance!
Roy


Re: How to clean up logs-dirs and local-dirs of running spark streaming in yarn cluster mode

2018-12-25 Thread Fawze Abujaber
http://shzhangji.com/blog/2015/05/31/spark-streaming-logging-configuration/

On Wed, Dec 26, 2018 at 1:05 AM shyla deshpande 
wrote:

> Please point me to any documentation if available. Thanks
>
> On Tue, Dec 18, 2018 at 11:10 AM shyla deshpande 
> wrote:
>
>> Is there a way to do this without stopping the streaming application in
>> yarn cluster mode?
>>
>> On Mon, Dec 17, 2018 at 4:42 PM shyla deshpande 
>> wrote:
>>
>>> I get the ERROR
>>> 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad:
>>> /var/log/hadoop-yarn/containers
>>>
>>> Is there a way to clean up these directories while the spark streaming
>>> application is running?
>>>
>>> Thanks
>>>
>>

-- 
Take Care
Fawze Abujaber


Re: Tuning G1GC params for aggressive garbage collection?

2018-12-25 Thread Akshay Mendole
Hi,
 Yes. I did try increasing to the value of number of cores. It did not
work as expected. I know System.gc is not a proper way. But, since this is
a batch application, I'm okay if it spends more time in doing GC and takes
considerably more cpu. Frequent system.gc calls with lower xmx (6 GB) gave
us more throughout as compared to no manual gc calls with higher xmx (8
GB). If I don't do manual gc calls when xmx is 6 GB the executor would go
oom. I was just wondering what Could be the best tuning parameters that
would give similar effect.
Thanks,
Akshay

On Tue, 25 Dec 2018, 11:28 pm Ramandeep Singh Nanda  Hi,
>
> Did you try increasing concurrentgcthreads for the marking?
>
> System.gc is not a good way to handle this, as it is not guaranteed and is
> a high pause,full gc.
>
> Regards,
> Ramandeep Singh
>
> On Tue, Dec 25, 2018, 07:09 Akshay Mendole 
>> Hi,
>>
>> My JVM is basically a spark executor which is running tasks one after
>> another. A task is memory hungry and requires significant memory during its
>> lifecycle.
>>
>> [image: Screen Shot 2018-12-25 at 3.48.51 PM.png]
>> The above JVM is running on G1GC with default params. As you can see in
>> the VisualVM report on the right hand side between 4:25 to 4:32 PM, the
>> spikes are due to each tasks run by the executor (essentially, each spike
>> is due to the executor picking up new task after the previous one is
>> finished). When I triggered a manual GC at 4:35, I saw a sharp decline the
>> heap usage. Also, as you can see on the left hand side in the JConsole
>> report, the old gen space is never collected by G1GC (the sharp decline in
>> old gen space just before 16:35 is due to the manual GC).
>>
>> As my application is a spark batch job application, I am ok if the JVMs
>> spent good amount of time doing GC. But, I am running bit short on memory.
>> So, I wanted to know how I can tune my JVM G1GC params so that there are
>> more frequent GC (with the old gen space also getting collected) and I can
>> get the work done with considerably lesses heap space (XMX).
>>
>> I tried changing -XX:InitiatingHeapOccupancyPercent . I tried with 0, 5,
>> 10. Lower the value, more frequent was the GC and more CPU it would
>> consume, but the behaviour would not be consistent. After sometime, the
>> heap usage would spike up and if we have set the XMX value lesser (6 GB)
>> than what's there in the picture above, it would throw OOM.
>>
>> *Something that is really working for us is attaching a daemon thread
>> that's calling System.gc() after an interval of 1 minute in each executor.*
>> But, I wanted to know how we can gracefully tune our GC for the same that
>> gives the same effect as the manual System.GC() calls.
>>
>> Thanks,
>> Akshay
>>
>>


Re: How to clean up logs-dirs and local-dirs of running spark streaming in yarn cluster mode

2018-12-25 Thread shyla deshpande
Please point me to any documentation if available. Thanks

On Tue, Dec 18, 2018 at 11:10 AM shyla deshpande 
wrote:

> Is there a way to do this without stopping the streaming application in
> yarn cluster mode?
>
> On Mon, Dec 17, 2018 at 4:42 PM shyla deshpande 
> wrote:
>
>> I get the ERROR
>> 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad:
>> /var/log/hadoop-yarn/containers
>>
>> Is there a way to clean up these directories while the spark streaming
>> application is running?
>>
>> Thanks
>>
>


Re: Packaging kafka certificates in uber jar

2018-12-25 Thread Anastasios Zouzias
Hi Colin,

You can place your certificates under src/main/resources and include them
in the uber JAR, see e.g. :
https://stackoverflow.com/questions/40252652/access-files-in-resources-directory-in-jar-from-apache-spark-streaming-context

Best,
Anastasios

On Mon, Dec 24, 2018 at 10:29 PM Colin Williams <
colin.williams.seat...@gmail.com> wrote:

> I've been trying to read from kafka via a spark streaming client. I
> found out spark cluster doesn't have certificates deployed. Then I
> tried using the same local certificates I've been testing with by
> packing them in an uber jar and getting a File handle from the
> Classloader resource. But I'm getting a File Not Found exception.
> These are jks certificates. Is anybody aware how to package
> certificates in a jar with a kafka client preferably the spark one?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
-- Anastasios Zouzias



Re: spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Jörn Franke
It could be that “Spark” checks if each file after the job and with 1 files 
on HDFS it can take some time. I think this also is format specific (eg for 
parquet it does some checks) and does not occur with all formats. This time is 
not really highlighted in the UI (maybe worth to raise an enhancement issue).

It could be also that you have stragglers (partitions skewered) somewhere, but 
I assume you checked that already. 

The only thing that you can do is to have less files (for the final output but 
also in between) or live with it. There are some other tuning methods as well 
(different outputcomitter etc), but that would require more in-depth knowledge 
of your application .

> Am 25.12.2018 um 13:08 schrieb Akshay Mendole :
> 
> Yes. We have lot of small files (10 K files of around 100 MB each ) that we 
> read from and write to HDFS. But the timeline shows, the jobs has completed 
> quite some time ago and the output directory is also updated at that time.
> Thanks,
> Akshay
> 
> 
>> On Tue, Dec 25, 2018 at 5:30 PM Jörn Franke  wrote:
>> Do you have a lot of small files? Do you use S3 or similar? It could be that 
>> Spark does some IO related tasks.
>> 
>> > Am 25.12.2018 um 12:51 schrieb Akshay Mendole :
>> > 
>> > Hi, 
>> >   As you can see in the picture below, the application last job 
>> > finished at around 13:45 and I could see the output directory updated with 
>> > the results. Yet, the application took a total of 20 min more to change 
>> > the status. What could be the reason for this? Is this a known fact? The 
>> > application has 3 jobs with many stages inside each having around 10K 
>> > tasks. Could the scale be reason for this? What is it exactly spark 
>> > framework doing during this time?
>> > 
>> > 
>> > 
>> > Thanks,
>> > Akshay
>> > 


Re: spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Jörn Franke
Do you have a lot of small files? Do you use S3 or similar? It could be that 
Spark does some IO related tasks.

> Am 25.12.2018 um 12:51 schrieb Akshay Mendole :
> 
> Hi, 
>   As you can see in the picture below, the application last job finished 
> at around 13:45 and I could see the output directory updated with the 
> results. Yet, the application took a total of 20 min more to change the 
> status. What could be the reason for this? Is this a known fact? The 
> application has 3 jobs with many stages inside each having around 10K tasks. 
> Could the scale be reason for this? What is it exactly spark framework doing 
> during this time?
> 
> 
> 
> Thanks,
> Akshay
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Akshay Mendole
Yes. We have lot of small files (10 K files of around 100 MB each ) that we
read from and write to HDFS. But the timeline shows, the jobs has completed
quite some time ago and the output directory is also updated at that time.
Thanks,
Akshay


On Tue, Dec 25, 2018 at 5:30 PM Jörn Franke  wrote:

> Do you have a lot of small files? Do you use S3 or similar? It could be
> that Spark does some IO related tasks.
>
> > Am 25.12.2018 um 12:51 schrieb Akshay Mendole :
> >
> > Hi,
> >   As you can see in the picture below, the application last job
> finished at around 13:45 and I could see the output directory updated with
> the results. Yet, the application took a total of 20 min more to change the
> status. What could be the reason for this? Is this a known fact? The
> application has 3 jobs with many stages inside each having around 10K
> tasks. Could the scale be reason for this? What is it exactly spark
> framework doing during this time?
> >
> > 
> >
> > Thanks,
> > Akshay
> >
>


spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Akshay Mendole
Hi,
  As you can see in the picture below, the application last job
finished at around 13:45 and I could see the output directory updated with
the results. Yet, the application took a total of 20 min more to change the
status. What could be the reason for this? Is this a known fact? The
application has 3 jobs with many stages inside each having around 10K
tasks. Could the scale be reason for this? What is it exactly spark
framework doing during this time?

[image: Screen Shot 2018-12-25 at 5.14.26 PM.png]

Thanks,
Akshay