Re: Spark read csv option - capture exception in a column in permissive mode

2019-06-16 Thread Ajay Thompson
There's a column which captures the corrupted record. However, the
exception isn't captured. If the exception is captured in another column
it'll be very useful.

On Mon, 17 Jun, 2019, 10:56 AM Gourav Sengupta, 
wrote:

> Hi,
>
> it already does, I think, you just have to add the column in the schema
> that you are using to read.
>
> Regards,
> Gourav
>
> On Sun, Jun 16, 2019 at 2:48 PM  wrote:
>
>> Hi Team,
>>
>>
>>
>> Can we have another column which gives the corrupted record reason in
>> permissive mode while reading csv.
>>
>>
>>
>> Thanks,
>>
>> Ajay
>>
>


Re: Creating Spark buckets that Presto / Athena / Hive can leverage

2019-06-16 Thread Gourav Sengupta
Hi Daniel,

not quite sure of this, but does Glue Data Catalogue support bucketing yet?
You might want to find that out first.


Regards,
Gourav

On Sat, Jun 15, 2019 at 1:30 PM Daniel Mateus Pires 
wrote:

> Hi there!
>
> I am trying to optimize joins on data created by Spark, so I'd like to
> bucket the data to avoid shuffling.
>
> I am writing to immutable partitions every day by writing data to a local
> HDFS and then copying this data to S3, is there a combination of bucketBy
> options and DDL that I can use so that Presto/Athena JOINs leverage the
> special layout of the data?
>
> e.g.
> CREATE EXTERNAL TABLE ...(on Presto/Athena)
> df.write.bucketBy(...).partitionBy(...). (in spark)
> then copy this data to S3 with s3-dist-cp
> then MSCK REPAIR TABLE (on Presto/Athena)
>
> Daniel
>
>


Re: Spark read csv option - capture exception in a column in permissive mode

2019-06-16 Thread Gourav Sengupta
Hi,

it already does, I think, you just have to add the column in the schema
that you are using to read.

Regards,
Gourav

On Sun, Jun 16, 2019 at 2:48 PM  wrote:

> Hi Team,
>
>
>
> Can we have another column which gives the corrupted record reason in
> permissive mode while reading csv.
>
>
>
> Thanks,
>
> Ajay
>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-16 Thread Hyukjin Kwon
Labels look good and useful.

On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun,  wrote:

> Now, you can see the exposed component labels (ordered by the number of
> PRs) here and click the component to search.
>
> https://github.com/apache/spark/labels?sort=count-desc
>
> Dongjoon.
>
>
> On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> JIRA and PR is ready for reviews.
>>
>> https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
>> component types at GitHub PRs)
>> https://github.com/apache/spark/pull/24871
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>>>
>>> Sure, we can do whatever we want.
>>>
>>> I'll wait for more feedbacks and proceed to the next steps.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
>>> wrote:
>>>
 Hi Dongjoon,
 Thanks for the proposal! I like the idea. Maybe we can extend it to
 component too and to some jira labels such as correctness which may be
 worth to highlight in PRs too. My only concern is that in many cases JIRAs
 are created not very carefully so they may be incorrect at the moment of
 the pr creation and it may be updated later: so keeping them in sync may be
 an extra effort..

 On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:

> Seems like a good idea. Can we test this with a component first?
>
> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>> contributions, we have lots of JIRAs and PRs consequently. One specific
>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>
>> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
>> There are two main benefits:
>> 1. It helps the communication between the contributors and reviewers
>> with more information.
>> (In some cases, some people only visit GitHub to see the PR and
>> commits)
>> 2. `Labels` is searchable. We don't need to visit Apache Jira to
>> search PRs to see a specific type.
>> (For example, the reviewers can see and review 'BUG' PRs first by
>> using `is:open is:pr label:BUG`.)
>>
>> Of course, this can be done automatically without human intervention.
>> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
>> can add the labels from the beginning. If needed, I can volunteer to 
>> update
>> the script.
>>
>> To show the demo, I labeled several PRs manually. You can see the
>> result right now in Apache Spark PR page.
>>
>>   - https://github.com/apache/spark/pulls
>>
>> If you're surprised due to those manual activities, I want to
>> apologize for that. I hope we can take advantage of the existing GitHub
>> features to serve Apache Spark community in a way better than yesterday.
>>
>> How do you think about this specific suggestion?
>>
>> Bests,
>> Dongjoon
>>
>> PS. I saw that `Request Review` and `Assign` features are already
>> used for some purposes, but these feature are out of the scope in this
>> email.
>>
>


Spark read csv option - capture exception in a column in permissive mode

2019-06-16 Thread ajay.thompson
Hi Team,

 

Can we have another column which gives the corrupted record reason in
permissive mode while reading csv.

 

Thanks,

Ajay



Re: [Pyspark 2.3+] Timeseries with Spark

2019-06-16 Thread Rishi Shah
Thanks Jorn. I am interested in timeseries forecasting for now but in
general I was unable to find a good way to work with different time series
methods using spark..

On Fri, Jun 14, 2019 at 1:55 AM Jörn Franke  wrote:

> Time series can mean a lot of different things and algorithms. Can you
> describe more what you mean by time series use case, ie what is the input,
> what do you like to do with the input and what is the output?
>
> > Am 14.06.2019 um 06:01 schrieb Rishi Shah :
> >
> > Hi All,
> >
> > I have a time series use case which I would like to implement in
> Spark... What would be the best way to do so? Any built in libraries?
> >
> > --
> > Regards,
> >
> > Rishi Shah
>


-- 
Regards,

Rishi Shah


Re: Spark 2.4.3 - Structured Streaming - high on Storage Memory

2019-06-16 Thread puneetloya
Just More info on the above post:

Have been seeing lot of these logs:

1) The state for version 15109(other numbers too) doesn't exist in
loadedMaps. Reading snapshot file and delta files if needed...Note that this
is normal for the first batch of starting query.

2) KafkaConsumer cache hitting max capacity of 64, removing consumer for
CacheKey



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org