from:"Burak Yavuz"

[jira] [Resolved] (SPARK-20708) Make `addExclusionRules` up-to-date

2017-05-31 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-20708.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0

> Make `addExclusionRules` up-to-date
> ---
>
> Key: SPARK-20708
> URL: https://issues.apache.org/jira/browse/SPARK-20708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> Since SPARK-9263, `resolveMavenCoordinates` ignores Spark and Spark's 
> dependencies by using `addExclusionRules`. This PR aims to make 
> `addExclusionRules` up-to-date to neglect correctly because it fails to 
> neglect some components like the following.
> *mllib (correct)*
> {code}
> $ bin/spark-shell --packages org.apache.spark:spark-mllib_2.11:2.1.1
> ...
> -
> |  |modules||   artifacts   |
> |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
> -
> |  default |   0   |   0   |   0   |   0   ||   0   |   0   |
> -
> {code}
> *mllib-local (wrong)*
> {code}
> $ bin/spark-shell --packages org.apache.spark:spark-mllib-local_2.11:2.1.1
> ...
> -
> |  |modules||   artifacts   |
> |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
> -
> |  default |   15  |   2   |   2   |   0   ||   15  |   2   |
> -
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20708) Make `addExclusionRules` up-to-date

2017-05-31 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032475#comment-16032475
 ] 

Burak Yavuz commented on SPARK-20708:
-

Resolved by https://github.com/apache/spark/pull/17947

> Make `addExclusionRules` up-to-date
> ---
>
> Key: SPARK-20708
> URL: https://issues.apache.org/jira/browse/SPARK-20708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> Since SPARK-9263, `resolveMavenCoordinates` ignores Spark and Spark's 
> dependencies by using `addExclusionRules`. This PR aims to make 
> `addExclusionRules` up-to-date to neglect correctly because it fails to 
> neglect some components like the following.
> *mllib (correct)*
> {code}
> $ bin/spark-shell --packages org.apache.spark:spark-mllib_2.11:2.1.1
> ...
> -
> |  |modules||   artifacts   |
> |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
> -
> |  default |   0   |   0   |   0   |   0   ||   0   |   0   |
> -
> {code}
> *mllib-local (wrong)*
> {code}
> $ bin/spark-shell --packages org.apache.spark:spark-mllib-local_2.11:2.1.1
> ...
> -
> |  |modules||   artifacts   |
> |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
> -
> |  default |   15  |   2   |   2   |   0   ||   15  |   2   |
> -
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Structured Streaming from Parquet

2017-05-25 Thread Burak Yavuz

Hi Paul,

>From what you're describing, it seems that stream1 is possibly generating
tons of small files and stream2 is OOMing because it tries to maintain an
in-memory list of files. Some notes/questions:

 1. Parquet files are splittable, therefore having large parquet files
shouldn't be a problem. The larger a parquet file is, the longer the write
process will take, but the read path shouldn't be adversely affected.
 2. How many partitions are you writing out to?
 3. In order to reduce the number of files, you may call:
`repartition(partitionColumns).writeStream.partitionBy(partitionColumns)`
so that every trigger, you output only 1 file per partition. After some
time, you may want to compact files if you don't partition by date.

Best,
Burak



On Thu, May 25, 2017 at 7:13 AM, Paul Corley 
wrote:

> I have a Spark Structured Streaming process that is implemented in 2
> separate streaming apps.
>
>
>
> First App reads .gz, which range in size from 1GB to 9GB compressed, files
> in from s3 filters out invalid records and repartitions the data and
> outputs to parquet on s3 partitioned the same as the stream is partitioned.
> This process produces thousands of files which other processes consume.
> The thought on this approach was to:
>
> 1)   Break the file down to smaller more easily consumed sizes
>
> 2)   Allow a more parallelism in the processes that consume the data.
>
> 3)   Allow multiple downstream processes to consume data that has
> already
>
> a.   Had bad records filtered out
>
> b.   Not have to fully read in such large files
>
>
>
> Second application reads in the files produced by the first app.  This
> process then reformats the data from a row that is:
>
>
>
> 12NDSIN|20170101:123313, 5467;20170115:987
>
>
>
> into:
>
> 12NDSIN, 20170101, 123313
>
> 12NDSIN, 20170101, 5467
>
> 12NDSIN, 20170115, 987
>
>
>
> App 1 runs no problems and churns through files in its source directory on
> s3.  Total process time for a file is < 10min.  App2 is the one having
> issues.
>
>
>
> The source is defined as
>
> *val *rawReader = sparkSession
>   .readStream
>   .option(*"latestFirst"*, *"true"*)
>   .option(*"maxFilesPerTrigger"*, batchSize)
>   .schema(rawSchema)
>   .parquet(config.getString(*"aws.s3.sourcepath"*))   ç===Line85
>
>
>
> output is defined as
>
> *val *query = output
>   .writeStream
>   .queryName(*"bk"*)
>   .format(*"parquet"*)
>   .partitionBy(*"expireDate"*)
>   .trigger(*ProcessingTime*(*"10 seconds"*))
>   .option(*"checkpointLocation"*,*config*.getString(
> *"spark.app.checkpoint_dir"*) + *"/bk"*)
>   .option(*"path"*, *config*.getString(*"spark.app.s3.output"*))
>   .start()
>   .awaitTermination()
>
>
>
> If files exist from app 1 app 2 enters a cycle of just cycling through parquet
> at ProcessFromSource.scala:85
> 
>   3999/3999
>
>
>
> If there are a few files output from app1 eventually it will enter the
> stage where it actually processes the data and begins to output, but the
> more files produced by app1 the longer it takes if it ever completes these
> steps.  With an extremely large number of files the app eventually throws a
> java OOM error. Additionally each cycle through this step takes
> successively longer.
>
> Hopefully someone can lend some insight as to what is actually taking
> place in this step and how to alleviate it
>
>
>
>
>
>
>
> Thanks,
>
>
>
> *Paul Corley* | Principle Data Engineer
>
>

[phpMyAdmin Git] [phpmyadmin/localized_docs] 3a556c: Translated using Weblate (Turkish)

2017-05-24 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: 3a556caa9429ae024a9ef84fb7abf147adf146f3
  
https://github.com/phpmyadmin/localized_docs/commit/3a556caa9429ae024a9ef84fb7abf147adf146f3
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-05-24 (Wed, 24 May 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2354 of 2354 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread Burak Yavuz

Hi Kant,

>
>
> 1. Can we use Spark Structured Streaming for stateless transformations
> just like we would do with DStreams or Spark Structured Streaming is only
> meant for stateful computations?
>

Of course you can do stateless transformations. Any map, filter, select,
type of transformation is stateless. Aggregations are generally stateful.
You could also perform arbitrary stateless aggregations with "flatMapGroups
"
or make them stateful with "flatMapGroupsWithState

".



> 2. When we use groupBy and Window operations for event time processing and
> specify a watermark does this mean the timestamp field in each message is
> compared to the processing time of that machine/node and discard that
> events that are late than the specified threshold? If we don't specify a
> watermark I am assuming the processing time wont come into the picture. is
> that right? Just trying to understand the interplay between processing time
> and even time when we do even time processing.
>
> Watermarks are tracked with respect to the event time of your data, not
the processing time of the machine. Please take a look at the blog below
for more details
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

Best,
Burak

[phpMyAdmin Git] [phpmyadmin/localized_docs] d74c4b: Translated using Weblate (Turkish)

2017-05-18 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: d74c4b4691e26179f1de914d76b81d25c93ea214
  
https://github.com/phpmyadmin/localized_docs/commit/d74c4b4691e26179f1de914d76b81d25c93ea214
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-05-18 (Thu, 18 May 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2354 of 2354 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] 37b73a: Translated using Weblate (Turkish)

2017-05-17 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: 37b73a2457778862b6539cdd8fa52511aa9132db
  
https://github.com/phpmyadmin/phpmyadmin/commit/37b73a2457778862b6539cdd8fa52511aa9132db
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-05-17 (Wed, 17 May 2017)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3207 of 3207 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] 445677: Translated using Weblate (Turkish)

2017-05-17 Thread Burak Yavuz

  Branch: refs/heads/QA_4_7
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: 445677ecd9de799ab4d0e3b695ccf6a72f1cfe3d
  
https://github.com/phpmyadmin/phpmyadmin/commit/445677ecd9de799ab4d0e3b695ccf6a72f1cfe3d
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-05-17 (Wed, 17 May 2017)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3198 of 3198 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Resolved] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-05-16 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-20140.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>  Labels: kinesis, recovery
> Fix For: 2.2.1, 2.3.0
>
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-05-16 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-20140:
---

Assignee: Yash Sharma

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>  Labels: kinesis, recovery
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-05-16 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013191#comment-16013191
 ] 

Burak Yavuz commented on SPARK-20140:
-

resolved by https://github.com/apache/spark/pull/17467

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>  Labels: kinesis, recovery
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20775) from_json should also have an API where the schema is specified with a string

2017-05-16 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-20775:
---

 Summary: from_json should also have an API where the schema is 
specified with a string
 Key: SPARK-20775
 URL: https://issues.apache.org/jira/browse/SPARK-20775
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Burak Yavuz


Right now you also have to provide a java.util.Map which is not nice for Scala 
users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[phpMyAdmin Git] [phpmyadmin/localized_docs] 252c86: Translated using Weblate (Turkish)

2017-05-10 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: 252c862dd57bb83b08a72a76939986154bb43350
  
https://github.com/phpmyadmin/localized_docs/commit/252c862dd57bb83b08a72a76939986154bb43350
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-05-09 (Tue, 09 May 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2362 of 2362 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz

Yes, unfortunately. This should actually be fixed, and the union's schema
should have the less restrictive of the DataFrames.

On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> HI Burak,
> By nullability you mean that if I have the exactly the same schema, but
> one side support null and the other doesn't, this exception (in union
> dataset) will be thrown?
>
>
>
> 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>:
>
>> I also want to add that generally these may be caused by the
>> `nullability` field in the schema.
>>
>> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> This is because RDD.union doesn't check the schema, so you won't see the
>>> problem unless you run RDD and hit the incompatible column problem. For
>>> RDD, You may not see any error if you don't use the incompatible column.
>>>
>>> Dataset.union requires compatible schema. You can print ds.schema and
>>> ds1.schema and check if they are same.
>>>
>>> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho <
>>> dirceu.semigh...@gmail.com> wrote:
>>>
>>>> Hello,
>>>> I've a very complex case class structure, with a lot of fields.
>>>> When I try to union two datasets of this class, it doesn't work with
>>>> the following error :
>>>> ds.union(ds1)
>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>> Union can only be performed on tables with the compatible column types
>>>>
>>>> But when use it's rdd, the union goes right:
>>>> ds.rdd.union(ds1.rdd)
>>>> res8: org.apache.spark.rdd.RDD[
>>>>
>>>> Is there any reason for this to happen (besides a bug ;) )
>>>>
>>>>
>>>>
>>>
>>
>

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz

I also want to add that generally these may be caused by the `nullability`
field in the schema.

On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu  wrote:

> This is because RDD.union doesn't check the schema, so you won't see the
> problem unless you run RDD and hit the incompatible column problem. For
> RDD, You may not see any error if you don't use the incompatible column.
>
> Dataset.union requires compatible schema. You can print ds.schema and
> ds1.schema and check if they are same.
>
> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hello,
>> I've a very complex case class structure, with a lot of fields.
>> When I try to union two datasets of this class, it doesn't work with the
>> following error :
>> ds.union(ds1)
>> Exception in thread "main" org.apache.spark.sql.AnalysisException: Union
>> can only be performed on tables with the compatible column types
>>
>> But when use it's rdd, the union goes right:
>> ds.rdd.union(ds1.rdd)
>> res8: org.apache.spark.rdd.RDD[
>>
>> Is there any reason for this to happen (besides a bug ;) )
>>
>>
>>
>

[jira] [Commented] (SPARK-20571) Flaky SparkR StructuredStreaming tests

2017-05-05 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998680#comment-15998680
 ] 

Burak Yavuz commented on SPARK-20571:
-

Thanks!

> Flaky SparkR StructuredStreaming tests
> --
>
> Key: SPARK-20571
> URL: https://issues.apache.org/jira/browse/SPARK-20571
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming
>Affects Versions: 2.2.0
>    Reporter: Burak Yavuz
>Assignee: Felix Cheung
> Fix For: 2.2.0, 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20441) Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation

2017-05-03 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-20441.
-
Resolution: Fixed

Resolved with https://github.com/apache/spark/pull/17735

> Within the same streaming query, one StreamingRelation should only be 
> transformed to one StreamingExecutionRelation
> ---
>
> Key: SPARK-20441
> URL: https://issues.apache.org/jira/browse/SPARK-20441
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Liwei Lin
>
> Within the same streaming query, when one StreamingRelation is referred 
> multiple times -- e.g. df.union(df) -- we should transform it only to one 
> StreamingExecutionRelation, instead of two or more different  
> StreamingExecutionRelations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20432) Unioning two identical Streaming DataFrames fails during attribute resolution

2017-05-03 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz closed SPARK-20432.
---
Resolution: Duplicate

> Unioning two identical Streaming DataFrames fails during attribute resolution
> -
>
> Key: SPARK-20432
> URL: https://issues.apache.org/jira/browse/SPARK-20432
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>    Reporter: Burak Yavuz
>
> To reproduce, try unioning two identical Kafka Streams:
> {code}
> df = spark.readStream.format("kafka")... \
>   .select(from_json(col("value").cast("string"), 
> simpleSchema).alias("parsed_value"))
> df.union(df).writeStream...
> {code}
> Exception is confusing:
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) value#526 
> missing from 
> value#511,topic#512,partition#513,offset#514L,timestampType#516,key#510,timestamp#515
>  in operator !Project [jsontostructs(...) AS parsed_value#357];
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20571) Flaky SparkR StructuredStreaming tests

2017-05-02 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994185#comment-15994185
 ] 

Burak Yavuz commented on SPARK-20571:
-

cc [~felixcheung]

> Flaky SparkR StructuredStreaming tests
> --
>
> Key: SPARK-20571
> URL: https://issues.apache.org/jira/browse/SPARK-20571
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming
>Affects Versions: 2.2.0
>    Reporter: Burak Yavuz
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20571) Flaky SparkR StructuredStreaming tests

2017-05-02 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-20571:
---

 Summary: Flaky SparkR StructuredStreaming tests
 Key: SPARK-20571
 URL: https://issues.apache.org/jira/browse/SPARK-20571
 Project: Spark
  Issue Type: Test
  Components: SparkR, Structured Streaming
Affects Versions: 2.2.0
Reporter: Burak Yavuz


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20549) java.io.CharConversionException: Invalid UTF-32 in JsonToStructs

2017-05-01 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-20549:
---

 Summary: java.io.CharConversionException: Invalid UTF-32 in 
JsonToStructs
 Key: SPARK-20549
 URL: https://issues.apache.org/jira/browse/SPARK-20549
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Burak Yavuz


The same fix for SPARK-16548 needs to be applied for JsonToStructs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan

2017-04-28 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-20496.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2

Resolved with https://github.com/apache/spark/pull/17804

> KafkaWriter Uses Unanalyzed Logical Plan
> 
>
> Key: SPARK-20496
> URL: https://issues.apache.org/jira/browse/SPARK-20496
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
> Fix For: 2.1.2, 2.2.0
>
>
> Right now we use the unanalyzed logical plan for writing to Kafka, we should 
> use the analyzed plan.
> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan

2017-04-28 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-20496:
---

Assignee: Bill Chambers

> KafkaWriter Uses Unanalyzed Logical Plan
> 
>
> Key: SPARK-20496
> URL: https://issues.apache.org/jira/browse/SPARK-20496
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>
> Right now we use the unanalyzed logical plan for writing to Kafka, we should 
> use the analyzed plan.
> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20432) Unioning two identical Streaming DataFrames fails during attribute resolution

2017-04-21 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-20432:
---

 Summary: Unioning two identical Streaming DataFrames fails during 
attribute resolution
 Key: SPARK-20432
 URL: https://issues.apache.org/jira/browse/SPARK-20432
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz


To reproduce, try unioning two identical Kafka Streams:

{code}
df = spark.readStream.format("kafka")... \
  .select(from_json(col("value").cast("string"), 
simpleSchema).alias("parsed_value"))

df.union(df).writeStream...
{code}

Exception is confusing:

{code}
org.apache.spark.sql.AnalysisException: resolved attribute(s) value#526 missing 
from 
value#511,topic#512,partition#513,offset#514L,timestampType#516,key#510,timestamp#515
 in operator !Project [jsontostructs(...) AS parsed_value#357];
{code}






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] 39fc41: Translated using Weblate (Turkish)

2017-04-12 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: 39fc41d15c6b2bfd566a7944d1a762b98f443d19
  
https://github.com/phpmyadmin/phpmyadmin/commit/39fc41d15c6b2bfd566a7944d1a762b98f443d19
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-04-12 (Wed, 12 Apr 2017)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3198 of 3198 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Created] (SPARK-20301) Flakiness in StreamingAggregationSuite

2017-04-11 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-20301:
---

 Summary: Flakiness in StreamingAggregationSuite
 Key: SPARK-20301
 URL: https://issues.apache.org/jira/browse/SPARK-20301
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz


https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20301) Flakiness in StreamingAggregationSuite

2017-04-11 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-20301:

Labels: flaky-test  (was: )

> Flakiness in StreamingAggregationSuite
> --
>
> Key: SPARK-20301
> URL: https://issues.apache.org/jira/browse/SPARK-20301
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>    Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>  Labels: flaky-test
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[phpMyAdmin Git] [phpmyadmin/localized_docs] b339ec: Translated using Weblate (Turkish)

2017-04-08 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: b339ecdf2d4b08ba0946d8a58f23f805f8f031ef
  
https://github.com/phpmyadmin/localized_docs/commit/b339ecdf2d4b08ba0946d8a58f23f805f8f031ef
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-04-08 (Sat, 08 Apr 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2356 of 2356 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Created] (SPARK-20230) FetchFailedExceptions should invalidate file caches in MapOutputTracker even if newer stages are launched

2017-04-05 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-20230:
---

 Summary: FetchFailedExceptions should invalidate file caches in 
MapOutputTracker even if newer stages are launched
 Key: SPARK-20230
 URL: https://issues.apache.org/jira/browse/SPARK-20230
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Burak Yavuz


If you lose instances that have shuffle outputs, you will start observing 
messages like:

{code}
17/03/24 11:49:23 WARN TaskSetManager: Lost task 0.0 in stage 64.1 (TID 3849, 
172.128.196.240, executor 0): FetchFailed(BlockManagerId(4, 172.128.200.157, 
4048, None), shuffleId=16, mapId=2, reduceId=3, message=
{code}

Generally, these messages are followed by:

{code}
17/03/24 11:49:23 INFO DAGScheduler: Executor lost: 4 (epoch 20)
17/03/24 11:49:23 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 
from BlockManagerMaster.
17/03/24 11:49:23 INFO BlockManagerMaster: Removed 4 successfully in 
removeExecutor
17/03/24 11:49:23 INFO DAGScheduler: Shuffle files lost for executor: 4 (epoch 
20)
17/03/24 11:49:23 INFO ShuffleMapStage: ShuffleMapStage 63 is now unavailable 
on executor 4 (73/89, false)
{code}

which is great. Spark resubmits tasks for data that has been lost. However, if 
you have cascading instance failures, then you may come across:

{code}
17/03/24 11:48:39 INFO DAGScheduler: Ignoring fetch failure from ResultTask(64, 
46) as it's from ResultStage 64 attempt 0 and there is a more recent attempt 
for that stage (attempt ID 1) running
{code}

which don't invalidate file outputs. In later retries of the stage, Spark will 
attempt to access files on machines that don't exist anymore, and then after 4 
tries, Spark will give up. If it had not ignored the fetch failure, and 
invalidated the cache, then most of the lost files could have been computed 
during one of the previous retries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[phpMyAdmin Git] [phpmyadmin/localized_docs] 27e1d7: Translated using Weblate (Turkish)

2017-04-03 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: 27e1d79fc1be067ef3ab402398eaba7b1fdc96f5
  
https://github.com/phpmyadmin/localized_docs/commit/27e1d79fc1be067ef3ab402398eaba7b1fdc96f5
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-04-04 (Tue, 04 Apr 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2354 of 2354 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/localized_docs] e00757: Translated using Weblate (Turkish)

2017-04-02 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: e0075726ad73bae46849758407ed6c1aacddbe51
  
https://github.com/phpmyadmin/localized_docs/commit/e0075726ad73bae46849758407ed6c1aacddbe51
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-04-02 (Sun, 02 Apr 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2354 of 2354 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/localized_docs] 9e9bd8: Translated using Weblate (Turkish)

2017-03-31 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: 9e9bd81d8d9a69034f5cbb5af9ee22955f3b7d3d
  
https://github.com/phpmyadmin/localized_docs/commit/9e9bd81d8d9a69034f5cbb5af9ee22955f3b7d3d
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-03-31 (Fri, 31 Mar 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2354 of 2354 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Resolved] (SPARK-19911) Add builder interface for Kinesis DStreams

2017-03-24 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19911.
-
  Resolution: Fixed
Assignee: Adam Budde
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

Resolved with https://github.com/apache/spark/pull/17250

> Add builder interface for Kinesis DStreams
> --
>
> Key: SPARK-19911
> URL: https://issues.apache.org/jira/browse/SPARK-19911
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
>Priority: Minor
> Fix For: 2.2.0
>
>
> The ```KinesisUtils.createStream()``` interface for creating Kinesis-based 
> DStreams is quite brittle and requires adding a combinatorial number of 
> overrides whenever another optional configuration parameter is added. This 
> makes incorporating a lot of additional features supported by the Kinesis 
> Client Library such as per-service authorization unfeasible. This interface 
> should be replaced by a builder pattern class 
> (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater 
> extensibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Burak Yavuz

Hi Everett,

IIRC we added unionAll in Spark 2.0 which is the same implementation as rdd
union. The union in DataFrames with Spark 2.0 does dedeuplication, and
that's why you should be seeing the slowdown.

Best,
Burak

On Thu, Mar 16, 2017 at 4:14 PM, Everett Anderson 
wrote:

> Looks like the Dataset version of union may also fail with the following
> on larger data sets, which again seems like it might be drawing everything
> into the driver for some reason --
>
> 7/03/16 22:28:21 WARN TaskSetManager: Lost task 1.0 in stage 91.0 (TID
> 5760, ip-10-8-52-198.us-west-2.compute.internal): 
> java.lang.IllegalArgumentException:
> bound must be positive
> at java.util.Random.nextInt(Random.java:388)
> at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.
> confChanged(LocalDirAllocator.java:305)
> at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.
> getLocalPathForWrite(LocalDirAllocator.java:344)
> at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.
> createTmpFileForWrite(LocalDirAllocator.java:416)
> at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(
> LocalDirAllocator.java:198)
> at org.apache.hadoop.fs.s3a.S3AOutputStream.(
> S3AOutputStream.java:87)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
> at org.apache.parquet.hadoop.ParquetFileWriter.(
> ParquetFileWriter.java:176)
> at org.apache.parquet.hadoop.ParquetFileWriter.(
> ParquetFileWriter.java:160)
> at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
> ParquetOutputFormat.java:289)
> at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
> ParquetOutputFormat.java:262)
> at org.apache.spark.sql.execution.datasources.parquet.
> ParquetOutputWriter.(ParquetFileFormat.scala:562)
> at org.apache.spark.sql.execution.datasources.parquet.
> ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
> at org.apache.spark.sql.execution.datasources.BaseWriterContainer.
> newOutputWriter(WriterContainer.scala:131)
> at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.
> writeRows(WriterContainer.scala:247)
> at org.apache.spark.sql.execution.datasources.
> InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$
> apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.sql.execution.datasources.
> InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$
> apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> On Thu, Mar 16, 2017 at 2:55 PM, Everett Anderson 
> wrote:
>
>> Hi,
>>
>> We're using Dataset union() in Spark 2.0.2 to concatenate a bunch of
>> tables together and save as Parquet to S3, but it seems to take a long
>> time. We're using the S3A FileSystem implementation under the covers, too,
>> if that helps.
>>
>> Watching the Spark UI, the executors all eventually stop (we're using
>> dynamic allocation) but under the SQL tab we can see a "save at
>> NativeMethodAccessorImpl.java:-2" in Running Queries. The driver is
>> still running of course, but it may take tens of minutes to finish. It
>> makes me wonder if our data all being collected through the driver.
>>
>> If we instead convert the Datasets to RDDs and call SparkContext.union()
>> it works quickly.
>>
>> Anyone know if this is a known issue?
>>
>>
>

[phpMyAdmin Git] [phpmyadmin/localized_docs] ea2f5f: Translated using Weblate (Turkish)

2017-03-13 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: ea2f5f4500513e0856a4cb56f5954f46b01fb230
  
https://github.com/phpmyadmin/localized_docs/commit/ea2f5f4500513e0856a4cb56f5954f46b01fb230
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-03-14 (Tue, 14 Mar 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2349 of 2349 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Created] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-19886:
---

 Summary: reportDataLoss cause != null check is wrong for 
Structured Streaming KafkaSource
 Key: SPARK-19886
 URL: https://issues.apache.org/jira/browse/SPARK-19886
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-08 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19813.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> maxFilesPerTrigger combo latestFirst may miss old files in combination with 
> maxFileAge in FileStreamSource
> --
>
> Key: SPARK-19813
> URL: https://issues.apache.org/jira/browse/SPARK-19813
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>
> There is a file stream source option called maxFileAge which limits how old 
> the files can be, relative the latest file that has been seen. This is used 
> to limit the files that need to be remembered as "processed". Files older 
> than the latest processed files are ignored. This values is by default 7 days.
> This causes a problem when both 
>  - latestFirst = true
>  - maxFilesPerTrigger > total files to be processed.
> Here is what happens in all combinations
>  1) latestFirst = false - Since files are processed in order, there wont be 
> any unprocessed file older than the latest processed file. All files will be 
> processed.
>  2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
> thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
> not, then all old files get processed in the first batch, and so no file is 
> left behind.
>  3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
> process the latest X files. That sets the threshold latest file - maxFileAge, 
> so files older than this threshold will never be considered for processing. 
> The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

2017-03-06 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19304.
-
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

Resolved by: https://github.com/apache/spark/pull/16842

> Kinesis checkpoint recovery is 10x slow
> ---
>
> Key: SPARK-19304
> URL: https://issues.apache.org/jira/browse/SPARK-19304
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: using s3 for checkpoints using 1 executor, with 19g mem 
> & 3 cores per executor
>Reporter: Gaurav Shah
>Assignee: Gaurav Shah
>  Labels: kinesis
> Fix For: 2.2.0
>
>
> Application runs fine initially, running batches of 1hour and the processing 
> time is less than 30 minutes on average. For some reason lets say the 
> application crashes, and we try to restart from checkpoint. The processing 
> now takes forever and does not move forward. We tried to test out the same 
> thing at batch interval of 1 minute, the processing runs fine and takes 1.2 
> minutes for batch to finish. When we recover from checkpoint it takes about 
> 15 minutes for each batch. Post the recovery the batches again process at 
> normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: 
> http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

2017-03-06 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-19304:
---

Assignee: Gaurav Shah

> Kinesis checkpoint recovery is 10x slow
> ---
>
> Key: SPARK-19304
> URL: https://issues.apache.org/jira/browse/SPARK-19304
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: using s3 for checkpoints using 1 executor, with 19g mem 
> & 3 cores per executor
>Reporter: Gaurav Shah
>Assignee: Gaurav Shah
>  Labels: kinesis
>
> Application runs fine initially, running batches of 1hour and the processing 
> time is less than 30 minutes on average. For some reason lets say the 
> application crashes, and we try to restart from checkpoint. The processing 
> now takes forever and does not move forward. We tried to test out the same 
> thing at batch interval of 1 minute, the processing runs fine and takes 1.2 
> minutes for batch to finish. When we recover from checkpoint it takes about 
> 15 minutes for each batch. Post the recovery the batches again process at 
> normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: 
> http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19595) from_json produces only a single row when input is a json array

2017-03-05 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19595.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/16929

> from_json produces only a single row when input is a json array
> ---
>
> Key: SPARK-19595
> URL: https://issues.apache.org/jira/browse/SPARK-19595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, {{from_json}} reads a single row when it is a json array. For 
> example,
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> Seq(("""[{"a": 1}, {"a": 
> 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
> ++
> |jsontostruct(struct)|
> ++
> | [1]|
> ++
> {code}
> Maybe we should not support this in that function or it should work like a 
> generator expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19595) from_json produces only a single row when input is a json array

2017-03-05 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-19595:
---

Assignee: Hyukjin Kwon

> from_json produces only a single row when input is a json array
> ---
>
> Key: SPARK-19595
> URL: https://issues.apache.org/jira/browse/SPARK-19595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, {{from_json}} reads a single row when it is a json array. For 
> example,
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> Seq(("""[{"a": 1}, {"a": 
> 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
> ++
> |jsontostruct(struct)|
> ++
> | [1]|
> ++
> {code}
> Maybe we should not support this in that function or it should work like a 
> generator expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-03 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-19813:
---

 Summary: maxFilesPerTrigger combo latestFirst may miss old files 
in combination with maxFileAge in FileStreamSource
 Key: SPARK-19813
 URL: https://issues.apache.org/jira/browse/SPARK-19813
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz


There is a file stream source option called maxFileAge which limits how old the 
files can be, relative the latest file that has been seen. This is used to 
limit the files that need to be remembered as "processed". Files older than the 
latest processed files are ignored. This values is by default 7 days.
This causes a problem when both 
 - latestFirst = true
 - maxFilesPerTrigger > total files to be processed.

Here is what happens in all combinations
 1) latestFirst = false - Since files are processed in order, there wont be any 
unprocessed file older than the latest processed file. All files will be 
processed.
 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
not, then all old files get processed in the first batch, and so no file is 
left behind.
 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
process the latest X files. That sets the threshold latest file - maxFileAge, 
so files older than this threshold will never be considered for processing. 

The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] b17a0f: Translated using Weblate (Turkish)

2017-03-01 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: b17a0f91cc23bf8800c4aa715d5a7c9a050de41a
  
https://github.com/phpmyadmin/phpmyadmin/commit/b17a0f91cc23bf8800c4aa715d5a7c9a050de41a
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-03-01 (Wed, 01 Mar 2017)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3194 of 3194 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Created] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails

2017-02-28 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-19774:
---

 Summary: StreamExecution should call stop() on sources when a 
stream fails
 Key: SPARK-19774
 URL: https://issues.apache.org/jira/browse/SPARK-19774
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0, 2.0.2
Reporter: Burak Yavuz
Assignee: Burak Yavuz


We call stop() on a Structured Streaming Source only when the stream is 
shutdown when a user calls streamingQuery.stop(). We should actually stop all 
sources when the stream fails as well, otherwise we may leak resources, e.g. 
connections to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[phpMyAdmin Git] [phpmyadmin/localized_docs] bb684c: Translated using Weblate (Turkish)

2017-02-23 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: bb684c9e7108bb04ac7421b6126f1f0b05209c94
  
https://github.com/phpmyadmin/localized_docs/commit/bb684c9e7108bb04ac7421b6126f1f0b05209c94
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-02-23 (Thu, 23 Feb 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2355 of 2355 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Resolved] (SPARK-19405) Add support to KinesisUtils for cross-account Kinesis reads via STS

2017-02-22 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19405.
-
   Resolution: Fixed
 Assignee: Adam Budde
Fix Version/s: 2.2.0

Resolved with: https://github.com/apache/spark/pull/16744

> Add support to KinesisUtils for cross-account Kinesis reads via STS
> ---
>
> Key: SPARK-19405
> URL: https://issues.apache.org/jira/browse/SPARK-19405
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Adam Budde
>Assignee: Adam Budde
>Priority: Minor
> Fix For: 2.2.0
>
>
> h1. Summary
> Enable KinesisReceiver to utilize STSAssumeRoleSessionCredentialsProvider 
> when setting up the Kinesis Client Library in order to enable secure 
> cross-account Kinesis stream reads managed by AWS Simple Token Service (STS)
> h1. Details
> Spark's KinesisReceiver implementation utilizes the Kinesis Client Library in 
> order to allow users to write Spark Streaming jobs that operate on Kinesis 
> data. The KCL uses a few AWS services under the hood in order to provide 
> checkpointed, load-balanced processing of the underlying data in a Kinesis 
> stream.  Running the KCL requires permissions to be set up for the following 
> AWS resources.
> * AWS Kinesis for reading stream data
> * AWS DynamoDB for storing KCL shared state in tables
> * AWS CloudWatch for logging KCL metrics
> The KinesisUtils.createStream() API allows users to authenticate to these 
> services either by specifying an explicit AWS access key/secret key 
> credential pair or by using the default credential provider chain. This 
> supports authorizing to the three AWS services using either an AWS keypair 
> (either provided explicitly or parsed from environment variables, etc.):
> !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/KeypairOnly.png!
> Or the IAM instance profile (when running on EC2):
> !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/InstanceProfileOnly.png!
> AWS users often need to access resources across separate accounts. This could 
> be done in order to consume data produced by another organization or from a 
> service running in another account for resource isolation purposes. AWS 
> Simple Token Service (STS) provides a secure way to authorize cross-account 
> resource access by using temporary sessions to assuming an IAM role in the 
> AWS account with the resources being accessed.
> The [IAM 
> documentation|http://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html]
>  covers the specifics of how cross account IAM role assumption works in much 
> greater detail, but if an actor in account A wanted to read from a Kinesis 
> stream in account B the general steps required would look something like this:
> * An IAM role is added to account B with read permissions for the Kinesis 
> stream
> ** Trust policy is configured to allow account A to assume the role 
> * Actor in account A uses its own long-lived credentials to tell STS to 
> assume the role in account B
> * STS returns temporary credentials with permission to read from the stream 
> in account B
> Applied to KinesisReceiver and the KCL, we could use a keypair as our 
> long-lived credentials to authenticate to STS and assume an external role 
> with the necessary KCL permissions:
> !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/STSKeypair.png!
> Or the instance profile as long-lived credentials:
> !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/STSInstanceProfile.png!
> The STSAssumeRoleSessionCredentialsProvider implementation of the 
> AWSCredentialsProviderChain interface from the AWS SDK abstracts all of the 
> management of the temporary session credentials away from the user. 
> STSAssumeRoleSessionCredentialsProvider simply needs the ARN of the AWS role 
> to be assumed, a session name for STS labeling purposes, an optional session 
> external ID and long-lived credentials to use for authenticating with the STS 
> service itself.
> Supporting cross-account Kinesis access via STS requires supplying the 
> following additional configuration parameters:
> * ARN of IAM role to assume in external account
> * A name to apply to the STS session
> * (optional) An IAM external ID to validate the assumed role against
> The STSAssumeRoleSessionCredentialsProvider implementation of the 
> AWSCredentialsProvider int

[jira] [Created] (SPARK-19637) add to_json APIs to SQL

2017-02-16 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-19637:
---

 Summary: add to_json APIs to SQL
 Key: SPARK-19637
 URL: https://issues.apache.org/jira/browse/SPARK-19637
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Burak Yavuz


The method "to_json" is a useful method in turning a struct into a json string. 
It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Burak Yavuz

Congrats Takuya!

On Mon, Feb 13, 2017 at 2:17 PM, Dilip Biswal  wrote:

> Congratulations, Takuya!
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980 <(408)%20463-4980>
> dbis...@us.ibm.com
>
>
>
> - Original message -
> From: Takeshi Yamamuro 
> To: dev 
> Cc:
> Subject: Re: welcoming Takuya Ueshin as a new Apache Spark committer
> Date: Mon, Feb 13, 2017 2:14 PM
>
> congrats!
>
>
> On Tue, Feb 14, 2017 at 6:05 AM, Sam Elamin 
> wrote:
>
> Congrats Takuya-san! Clearly well deserved! Well done :)
>
> On Mon, Feb 13, 2017 at 9:02 PM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
> Congratulations!
>
>
> On 02/13/2017 08:16 PM, Reynold Xin wrote:
> > Hi all,
> >
> > Takuya-san has recently been elected an Apache Spark committer. He's
> > been active in the SQL area and writes very small, surgical patches
> > that are high quality. Please join me in congratulating Takuya-san!
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
>
>
> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

[jira] [Resolved] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors

2017-02-13 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19542.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Delete the temp checkpoint if a query is stopped without errors
> ---
>
> Key: SPARK-19542
> URL: https://issues.apache.org/jira/browse/SPARK-19542
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19543) from_json fails when the input row is empty

2017-02-09 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-19543:
---

 Summary: from_json fails when the input row is empty 
 Key: SPARK-19543
 URL: https://issues.apache.org/jira/browse/SPARK-19543
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Burak Yavuz


Using from_json on a column with an empty string results in: 
java.util.NoSuchElementException: head of empty list




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz

I presume you may be able to implement a custom sink and use
df.saveAsTable. The problem is that you will have to handle idempotence /
garbage collection yourself, in case your job fails while writing, etc.

On Mon, Feb 6, 2017 at 5:53 PM, Egor Pahomov <pahomov.e...@gmail.com> wrote:

> I have stream of files on HDFS with JSON events. I need to convert it to
> pq in realtime, process some fields and store in simple Hive table so
> people can query it. People even might want to query it with Impala, so
> it's important, that it would be real Hive metastore based table. How can I
> do that?
>
> 2017-02-06 14:25 GMT-08:00 Burak Yavuz <brk...@gmail.com>:
>
>> Hi Egor,
>>
>> Structured Streaming handles all of its metadata itself, which files are
>> actually valid, etc. You may use the "create table" syntax in SQL to treat
>> it like a hive table, but it will handle all partitioning information in
>> its own metadata log. Is there a specific reason that you want to store the
>> information in the Hive Metastore?
>>
>> Best,
>> Burak
>>
>> On Mon, Feb 6, 2017 at 11:39 AM, Egor Pahomov <pahomov.e...@gmail.com>
>> wrote:
>>
>>> Hi, I'm thinking of using Structured Streaming instead of old streaming,
>>> but I need to be able to save results to Hive table. Documentation for file
>>> sink says(http://spark.apache.org/docs/latest/structured-streamin
>>> g-programming-guide.html#output-sinks): "Supports writes to partitioned
>>> tables. ". But being able to write to partitioned directories is not
>>> enough to write to the table: someone needs to write to Hive metastore. How
>>> can I use Structured Streaming and write to Hive table?
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz

Hi Egor,

Structured Streaming handles all of its metadata itself, which files are
actually valid, etc. You may use the "create table" syntax in SQL to treat
it like a hive table, but it will handle all partitioning information in
its own metadata log. Is there a specific reason that you want to store the
information in the Hive Metastore?

Best,
Burak

On Mon, Feb 6, 2017 at 11:39 AM, Egor Pahomov 
wrote:

> Hi, I'm thinking of using Structured Streaming instead of old streaming,
> but I need to be able to save results to Hive table. Documentation for file
> sink says(http://spark.apache.org/docs/latest/structured-
> streaming-programming-guide.html#output-sinks): "Supports writes to
> partitioned tables. ". But being able to write to partitioned directories
> is not enough to write to the table: someone needs to write to Hive
> metastore. How can I use Structured Streaming and write to Hive table?
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] 98ad19: Translated using Weblate (Turkish)

2017-02-01 Thread Burak Yavuz

  Branch: refs/heads/QA_4_7
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: 98ad19576a8fb0fa588a6b060b88a7ab514b23e9
  
https://github.com/phpmyadmin/phpmyadmin/commit/98ad19576a8fb0fa588a6b060b88a7ab514b23e9
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-02-01 (Wed, 01 Feb 2017)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3199 of 3199 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Burak Yavuz

Hi Koert,

When eager is true, we return you a new DataFrame that depends on the files
written out to the checkpoint directory.
All previous operations on the checkpointed DataFrame are gone forever. You
basically start fresh. AFAIK, when eager is true, the method will not
return until the DataFrame is completely checkpointed. If you look at the
RDD.checkpoint implementation, the checkpoint location is updated
synchronously therefore during the count, `isCheckpointed` will be true.

Best,
Burak

On Tue, Jan 31, 2017 at 12:52 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i understand that checkpoint cuts the lineage, but i am not fully sure i
> understand the role of eager.
>
> eager simply seems to materialize the rdd early with a count, right after
> the rdd has been checkpointed. but why is that useful? rdd.checkpoint is
> asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed
> will be false, and the count will be on the rdd before it was checkpointed.
> what is the benefit of that?
>
>
> On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> Hi,
>>
>> One of the goals of checkpointing is to cut the RDD lineage. Otherwise
>> you run into StackOverflowExceptions. If you eagerly checkpoint, you
>> basically cut the lineage there, and the next operations all depend on the
>> checkpointed DataFrame. If you don't checkpoint, you continue to build the
>> lineage, therefore while that lineage is being resolved, you may hit the
>> StackOverflowException.
>>
>> HTH,
>> Burak
>>
>> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin <j...@jgp.net>
>> wrote:
>>
>>> Hey Sparkers,
>>>
>>> Trying to understand the Dataframe's checkpoint (*not* in the context
>>> of streaming) https://spark.apache.org/docs/latest/api/java/org
>>> /apache/spark/sql/Dataset.html#checkpoint(boolean)
>>>
>>> What is the goal of the *eager* flag?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>
>>
>

Re: eager? in dataframe's checkpoint

2017-01-26 Thread Burak Yavuz

Hi,

One of the goals of checkpointing is to cut the RDD lineage. Otherwise you
run into StackOverflowExceptions. If you eagerly checkpoint, you basically
cut the lineage there, and the next operations all depend on the
checkpointed DataFrame. If you don't checkpoint, you continue to build the
lineage, therefore while that lineage is being resolved, you may hit the
StackOverflowException.

HTH,
Burak

On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin  wrote:

> Hey Sparkers,
>
> Trying to understand the Dataframe's checkpoint (*not* in the context of
> streaming) https://spark.apache.org/docs/latest/api/
> java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)
>
> What is the goal of the *eager* flag?
>
> Thanks!
>
> jg
>

[jira] [Updated] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases

2017-01-26 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18218:

Assignee: Weichen Xu

> Optimize BlockMatrix multiplication, which may cause OOM and low parallelism 
> usage problem in several cases
> ---
>
> Key: SPARK-18218
> URL: https://issues.apache.org/jira/browse/SPARK-18218
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After I take a deep look into `BlockMatrix.multiply` implementation, I found 
> that current implementation may cause some problem in special cases.
> Now let me use an extreme case to represent it:
> Suppose we have two blockMatrix A and B
> A has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> B also has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> Now if we call A.mulitiply(B), no matter how A and B is partitioned,
> the resultPartitioner will always contains only one partition,
> this muliplication implementation will shuffle 1 * 1 blocks into one 
> reducer, this will cause the parallism became 1, 
> what's worse, because `RDD.cogroup` will load the total group element into 
> memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, 
> because they are all shuffled into the same group. It will easily cause 
> executor OOM.
> The above case is a little extreme, but other case, such as M*N dimensions 
> matrix A multiply N*P dimensions matrix B, when N is much larger than M and 
> P, we met the similar problem.
> The multiplication implementation do not handle the task partition properly, 
> it will cause:
> 1. when the middle dimension N is too large, it will cause reducer OOM.
> 2. even if OOM do not occur, it will still cause parallism too low.
> 3. when N is much large than M and P, and matrix A and B have many 
> partitions, it will cause too many partition on M and P dimension, it will 
> cause much larger shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases

2017-01-26 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-18218.
-
   Resolution: Implemented
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/15730

> Optimize BlockMatrix multiplication, which may cause OOM and low parallelism 
> usage problem in several cases
> ---
>
> Key: SPARK-18218
> URL: https://issues.apache.org/jira/browse/SPARK-18218
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After I take a deep look into `BlockMatrix.multiply` implementation, I found 
> that current implementation may cause some problem in special cases.
> Now let me use an extreme case to represent it:
> Suppose we have two blockMatrix A and B
> A has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> B also has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> Now if we call A.mulitiply(B), no matter how A and B is partitioned,
> the resultPartitioner will always contains only one partition,
> this muliplication implementation will shuffle 1 * 1 blocks into one 
> reducer, this will cause the parallism became 1, 
> what's worse, because `RDD.cogroup` will load the total group element into 
> memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, 
> because they are all shuffled into the same group. It will easily cause 
> executor OOM.
> The above case is a little extreme, but other case, such as M*N dimensions 
> matrix A multiply N*P dimensions matrix B, when N is much larger than M and 
> P, we met the similar problem.
> The multiplication implementation do not handle the task partition properly, 
> it will cause:
> 1. when the middle dimension N is too large, it will cause reducer OOM.
> 2. even if OOM do not occur, it will still cause parallism too low.
> 3. when N is much large than M and P, and matrix A and B have many 
> partitions, it will cause too many partition on M and P dimension, it will 
> cause much larger shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger

2017-01-26 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-19378:

Description: 
If you have a StreamingDataFrame with an aggregation, we report a metric called 
stateOperators which consists of a list of data points per aggregation for our 
query (With Spark 2.1, only one aggregation is supported).

These data points report:
 - numUpdatedStateRows
 - numTotalStateRows

If a trigger had no data - therefore was not fired - we return 0 data points, 
however we should actually return a data point with
 - numTotalStateRows: numTotalStateRows in lastExecution
 - numUpdatedStateRows: 0

This also affects eventTime statistics. We should still provide the min, max, 
avg even through the data didn't change.

  was:
If you have a StreamingDataFrame with an aggregation, we report a metric called 
stateOperators which consists of a list of data points per aggregation for our 
query (With Spark 2.1, only one aggregation is supported).

These data points report:
 - numUpdatedStateRows
 - numTotalStateRows

If a trigger had no data - therefore was not fired - we return 0 data points, 
however we should actually return a data point with
 - numTotalStateRows: numTotalStateRows in lastExecution
 - numUpdatedStateRows: 0



> StateOperator metrics should still return the total number of rows in state 
> even if there was no data for a trigger
> ---
>
> Key: SPARK-19378
> URL: https://issues.apache.org/jira/browse/SPARK-19378
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> If you have a StreamingDataFrame with an aggregation, we report a metric 
> called stateOperators which consists of a list of data points per aggregation 
> for our query (With Spark 2.1, only one aggregation is supported).
> These data points report:
>  - numUpdatedStateRows
>  - numTotalStateRows
> If a trigger had no data - therefore was not fired - we return 0 data points, 
> however we should actually return a data point with
>  - numTotalStateRows: numTotalStateRows in lastExecution
>  - numUpdatedStateRows: 0
> This also affects eventTime statistics. We should still provide the min, max, 
> avg even through the data didn't change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger

2017-01-26 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-19378:
---

 Summary: StateOperator metrics should still return the total 
number of rows in state even if there was no data for a trigger
 Key: SPARK-19378
 URL: https://issues.apache.org/jira/browse/SPARK-19378
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz


If you have a StreamingDataFrame with an aggregation, we report a metric called 
stateOperators which consists of a list of data points per aggregation for our 
query (With Spark 2.1, only one aggregation is supported).

These data points report:
 - numUpdatedStateRows
 - numTotalStateRows

If a trigger had no data - therefore was not fired - we return 0 data points, 
however we should actually return a data point with
 - numTotalStateRows: numTotalStateRows in lastExecution
 - numUpdatedStateRows: 0




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Java heap error during matrix multiplication

2017-01-26 Thread Burak Yavuz

Hi,

Have you tried creating more column blocks?

BlockMatrix matrix = cmatrix.toBlockMatrix(100, 100);

for example.


Is your data randomly spread out, or do you generally have clusters of
data points together?


On Wed, Jan 25, 2017 at 4:23 AM, Petr Shestov  wrote:

> Hi all!
>
> I'm using Spark 2.0.1 with two workers (one executor each) with 20Gb each.
> And run following code:
>
> JavaRDD entries = ...; // filing the dataCoordinateMatrix 
> cmatrix = new CoordinateMatrix(entries.rdd());BlockMatrix matrix = 
> cmatrix.toBlockMatrix(100, 1000);BlockMatrix cooc = 
> matrix.transpose().multiply(matrix);
>
> My matrix is approx 8 000 000 x 3000, but only 10 000 000 cells have
> meaningful value. During multiplication I always get:
>
> 17/01/24 08:03:10 WARN TaskMemoryManager: leak 1322.6 MB memory from 
> org.apache.spark.util.collection.ExternalAppendOnlyMap@649e701917/01/24 
> 08:03:10 ERROR Executor: Exception in task 1.0 in stage 57.0 (TID 83664)
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.mllib.linalg.DenseMatrix$.zeros(Matrices.scala:453)
> at 
> org.apache.spark.mllib.linalg.Matrix$class.multiply(Matrices.scala:101)
> at 
> org.apache.spark.mllib.linalg.SparseMatrix.multiply(Matrices.scala:565)
> at 
> org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23$$anonfun$apply$9$$anonfun$apply$11.apply(BlockMatrix.scala:483)
> at 
> org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23$$anonfun$apply$9$$anonfun$apply$11.apply(BlockMatrix.scala:480)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23$$anonfun$apply$9.apply(BlockMatrix.scala:480)
> at 
> org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23$$anonfun$apply$9.apply(BlockMatrix.scala:479)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at 
> org.apache.spark.util.collection.CompactBuffer$$anon$1.foreach(CompactBuffer.scala:115)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at 
> org.apache.spark.util.collection.CompactBuffer.foreach(CompactBuffer.scala:30)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at 
> org.apache.spark.util.collection.CompactBuffer.flatMap(CompactBuffer.scala:30)
> at 
> org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23.apply(BlockMatrix.scala:479)
> at 
> org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23.apply(BlockMatrix.scala:478)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> Now I'm even trying to use only one core per executor. What can be the
> problem? And how can I debug it and find root cause? What could I miss in
> spark configuration?
>
> I've already tried increasing spark.default.parallelism and decreasing
> blocks size for BlockMatrix.
>
> Thanks.
>

[jira] [Updated] (SPARK-18020) Kinesis receiver does not snapshot when shard completes

2017-01-25 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18020:

Assignee: Takeshi Yamamuro

> Kinesis receiver does not snapshot when shard completes
> ---
>
> Key: SPARK-18020
> URL: https://issues.apache.org/jira/browse/SPARK-18020
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Yonathan Randolph
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: kinesis
> Fix For: 2.2.0
>
>
> When a kinesis shard is split or combined and the old shard ends, the Amazon 
> Kinesis Client library [calls 
> IRecordProcessor.shutdown|https://github.com/awslabs/amazon-kinesis-client/blob/v1.7.0/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/ShutdownTask.java#L100]
>  and expects that {{IRecordProcessor.shutdown}} must checkpoint the sequence 
> number {{ExtendedSequenceNumber.SHARD_END}} before returning. Unfortunately, 
> spark’s 
> [KinesisRecordProcessor|https://github.com/apache/spark/blob/v2.0.1/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala]
>  sometimes does not checkpoint SHARD_END. This results in an error message, 
> and spark is then blocked indefinitely from processing any items from the 
> child shards.
> This issue has also been raised on StackOverflow: [resharding while spark 
> running on kinesis 
> stream|http://stackoverflow.com/questions/38898691/resharding-while-spark-running-on-kinesis-stream]
> Exception that is logged:
> {code}
> 16/10/19 19:37:49 ERROR worker.ShutdownTask: Application exception. 
> java.lang.IllegalArgumentException: Application didn't checkpoint at end of 
> shard shardId-0030
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:106)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Command used to split shard:
> {code}
> aws kinesis --region us-west-1 split-shard --stream-name my-stream 
> --shard-to-split shardId-0030 --new-starting-hash-key 
> 5316911983139663491615228241121378303
> {code}
> After the spark-streaming job has hung, examining the DynamoDB table 
> indicates that the parent shard processor has not reached 
> {{ExtendedSequenceNumber.SHARD_END}} and the child shards are still at 
> {{ExtendedSequenceNumber.TRIM_HORIZON}} waiting for the parent to finish:
> {code}
> aws kinesis --region us-west-1 describe-stream --stream-name my-stream
> {
> "StreamDescription": {
> "RetentionPeriodHours": 24, 
> "StreamName": "my-stream", 
> "Shards": [
> {
> "ShardId": "shardId-0030", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
> "StartingHashKey": "0"
> },
> ...
> }, 
> {
> "ShardId": "shardId-0062", 
> "HashKeyRange": {
> "EndingHashKey": "5316911983139663491615228241121378302", 
> "StartingHashKey": "0"
> }, 
> "ParentShardId": "shardId-0030", 
> "SequenceNumberRange": {
> "StartingSequenceNumber": 
> "49566806087883755242230188435465744452396445937434624994"
> }
> }, 
> {
> "ShardId": "shardId-0063", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
>

[jira] [Resolved] (SPARK-18020) Kinesis receiver does not snapshot when shard completes

2017-01-25 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-18020.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/16213

> Kinesis receiver does not snapshot when shard completes
> ---
>
> Key: SPARK-18020
> URL: https://issues.apache.org/jira/browse/SPARK-18020
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Yonathan Randolph
>Priority: Minor
>  Labels: kinesis
> Fix For: 2.2.0
>
>
> When a kinesis shard is split or combined and the old shard ends, the Amazon 
> Kinesis Client library [calls 
> IRecordProcessor.shutdown|https://github.com/awslabs/amazon-kinesis-client/blob/v1.7.0/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/ShutdownTask.java#L100]
>  and expects that {{IRecordProcessor.shutdown}} must checkpoint the sequence 
> number {{ExtendedSequenceNumber.SHARD_END}} before returning. Unfortunately, 
> spark’s 
> [KinesisRecordProcessor|https://github.com/apache/spark/blob/v2.0.1/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala]
>  sometimes does not checkpoint SHARD_END. This results in an error message, 
> and spark is then blocked indefinitely from processing any items from the 
> child shards.
> This issue has also been raised on StackOverflow: [resharding while spark 
> running on kinesis 
> stream|http://stackoverflow.com/questions/38898691/resharding-while-spark-running-on-kinesis-stream]
> Exception that is logged:
> {code}
> 16/10/19 19:37:49 ERROR worker.ShutdownTask: Application exception. 
> java.lang.IllegalArgumentException: Application didn't checkpoint at end of 
> shard shardId-0030
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:106)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Command used to split shard:
> {code}
> aws kinesis --region us-west-1 split-shard --stream-name my-stream 
> --shard-to-split shardId-0030 --new-starting-hash-key 
> 5316911983139663491615228241121378303
> {code}
> After the spark-streaming job has hung, examining the DynamoDB table 
> indicates that the parent shard processor has not reached 
> {{ExtendedSequenceNumber.SHARD_END}} and the child shards are still at 
> {{ExtendedSequenceNumber.TRIM_HORIZON}} waiting for the parent to finish:
> {code}
> aws kinesis --region us-west-1 describe-stream --stream-name my-stream
> {
> "StreamDescription": {
> "RetentionPeriodHours": 24, 
> "StreamName": "my-stream", 
> "Shards": [
> {
> "ShardId": "shardId-0030", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
> "StartingHashKey": "0"
> },
> ...
> }, 
> {
> "ShardId": "shardId-0062", 
> "HashKeyRange": {
> "EndingHashKey": "5316911983139663491615228241121378302", 
> "StartingHashKey": "0"
> }, 
> "ParentShardId": "shardId-0030", 
> "SequenceNumberRange": {
> "StartingSequenceNumber": 
> "49566806087883755242230188435465744452396445937434624994"
> }
> }, 
> {
> "ShardId": "shardId-0063", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz

Yes you may. Depends on if you want exact values or if you're okay with
approximations. With Big Data, generally you would be okay with
approximations. Try both out, see what scales/works with your dataset.
Maybe you may handle the second implementation.

On Wed, Jan 25, 2017 at 12:23 PM, shyla deshpande <deshpandesh...@gmail.com>
wrote:

> Thanks Burak. But with BloomFilter, won't I be getting a false poisitve?
>
> On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> I noticed that 1 wouldn't be a problem, because you'll save the
>> BloomFilter in the state.
>>
>> For 2, you would keep a Map of UUID's to the timestamp of when you saw
>> them. If the UUID exists in the map, then you wouldn't increase the count.
>> If the timestamp of a UUID expires, you would remove it from the map. The
>> reason we remove from the map is to keep a bounded amount of space. It'll
>> probably take a lot more space than the BloomFilter though depending on
>> your data volume.
>>
>> On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> In the previous email you gave me 2 solutions
>>> 1. Bloom filter --> problem in repopulating the bloom filter on restarts
>>> 2. keeping the state of the unique ids
>>>
>>> Please elaborate on 2.
>>>
>>>
>>>
>>> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>>
>>>> I don't have any sample code, but on a high level:
>>>>
>>>> My state would be: (Long, BloomFilter[UUID])
>>>> In the update function, my value will be the UUID of the record, since
>>>> the word itself is the key.
>>>> I'll ask my BloomFilter if I've seen this UUID before. If not increase
>>>> count, also add to Filter.
>>>>
>>>> Does that make sense?
>>>>
>>>>
>>>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>>>> deshpandesh...@gmail.com> wrote:
>>>>
>>>>> Hi Burak,
>>>>> Thanks for the response. Can you please elaborate on your idea of
>>>>> storing the state of the unique ids.
>>>>> Do you have any sample code or links I can refer to.
>>>>> Thanks
>>>>>
>>>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>>>>
>>>>>> Off the top of my head... (Each may have it's own issues)
>>>>>>
>>>>>> If upstream you add a uniqueId to all your records, then you may use
>>>>>> a BloomFilter to approximate if you've seen a row before.
>>>>>> The problem I can see with that approach is how to repopulate the
>>>>>> bloom filter on restarts.
>>>>>>
>>>>>> If you are certain that you're not going to reprocess some data after
>>>>>> a certain time, i.e. there is no way I'm going to get the same data in 2
>>>>>> hours, it may only happen in the last 2 hours, then you may also keep the
>>>>>> state of uniqueId's as well, and then age them out after a certain time.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Burak
>>>>>>
>>>>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>
>>>>>>> Please share your thoughts.
>>>>>>>
>>>>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> My streaming application stores lot of aggregations using
>>>>>>>>> mapWithState.
>>>>>>>>>
>>>>>>>>> I want to know what are all the possible ways I can make it
>>>>>>>>> idempotent.
>>>>>>>>>
>>>>>>>>> Please share your views.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> In a Wordcount application which  stores the count of all the
>>>>>>>>>> words input so far using mapWithState.  How do I make sure my counts 
>>>>>>>>>> are
>>>>>>>>>> not messed up if I happen to read a line more than once?
>>>>>>>>>>
>>>>>>>>>> Appreciate your response.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz

I noticed that 1 wouldn't be a problem, because you'll save the BloomFilter
in the state.

For 2, you would keep a Map of UUID's to the timestamp of when you saw
them. If the UUID exists in the map, then you wouldn't increase the count.
If the timestamp of a UUID expires, you would remove it from the map. The
reason we remove from the map is to keep a bounded amount of space. It'll
probably take a lot more space than the BloomFilter though depending on
your data volume.

On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <deshpandesh...@gmail.com>
wrote:

> In the previous email you gave me 2 solutions
> 1. Bloom filter --> problem in repopulating the bloom filter on restarts
> 2. keeping the state of the unique ids
>
> Please elaborate on 2.
>
>
>
> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> I don't have any sample code, but on a high level:
>>
>> My state would be: (Long, BloomFilter[UUID])
>> In the update function, my value will be the UUID of the record, since
>> the word itself is the key.
>> I'll ask my BloomFilter if I've seen this UUID before. If not increase
>> count, also add to Filter.
>>
>> Does that make sense?
>>
>>
>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Hi Burak,
>>> Thanks for the response. Can you please elaborate on your idea of
>>> storing the state of the unique ids.
>>> Do you have any sample code or links I can refer to.
>>> Thanks
>>>
>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>>
>>>> Off the top of my head... (Each may have it's own issues)
>>>>
>>>> If upstream you add a uniqueId to all your records, then you may use a
>>>> BloomFilter to approximate if you've seen a row before.
>>>> The problem I can see with that approach is how to repopulate the bloom
>>>> filter on restarts.
>>>>
>>>> If you are certain that you're not going to reprocess some data after a
>>>> certain time, i.e. there is no way I'm going to get the same data in 2
>>>> hours, it may only happen in the last 2 hours, then you may also keep the
>>>> state of uniqueId's as well, and then age them out after a certain time.
>>>>
>>>>
>>>> Best,
>>>> Burak
>>>>
>>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>>>> deshpandesh...@gmail.com> wrote:
>>>>
>>>>> Please share your thoughts.
>>>>>
>>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>>>> deshpandesh...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>
>>>>>>> My streaming application stores lot of aggregations using
>>>>>>> mapWithState.
>>>>>>>
>>>>>>> I want to know what are all the possible ways I can make it
>>>>>>> idempotent.
>>>>>>>
>>>>>>> Please share your views.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> In a Wordcount application which  stores the count of all the words
>>>>>>>> input so far using mapWithState.  How do I make sure my counts are not
>>>>>>>> messed up if I happen to read a line more than once?
>>>>>>>>
>>>>>>>> Appreciate your response.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz

I don't have any sample code, but on a high level:

My state would be: (Long, BloomFilter[UUID])
In the update function, my value will be the UUID of the record, since the
word itself is the key.
I'll ask my BloomFilter if I've seen this UUID before. If not increase
count, also add to Filter.

Does that make sense?


On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <deshpandesh...@gmail.com>
wrote:

> Hi Burak,
> Thanks for the response. Can you please elaborate on your idea of storing
> the state of the unique ids.
> Do you have any sample code or links I can refer to.
> Thanks
>
> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> Off the top of my head... (Each may have it's own issues)
>>
>> If upstream you add a uniqueId to all your records, then you may use a
>> BloomFilter to approximate if you've seen a row before.
>> The problem I can see with that approach is how to repopulate the bloom
>> filter on restarts.
>>
>> If you are certain that you're not going to reprocess some data after a
>> certain time, i.e. there is no way I'm going to get the same data in 2
>> hours, it may only happen in the last 2 hours, then you may also keep the
>> state of uniqueId's as well, and then age them out after a certain time.
>>
>>
>> Best,
>> Burak
>>
>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Please share your thoughts.
>>>
>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>>> deshpandesh...@gmail.com> wrote:
>>>>
>>>>> My streaming application stores lot of aggregations using
>>>>> mapWithState.
>>>>>
>>>>> I want to know what are all the possible ways I can make it
>>>>> idempotent.
>>>>>
>>>>> Please share your views.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>>>> deshpandesh...@gmail.com> wrote:
>>>>>
>>>>>> In a Wordcount application which  stores the count of all the words
>>>>>> input so far using mapWithState.  How do I make sure my counts are not
>>>>>> messed up if I happen to read a line more than once?
>>>>>>
>>>>>> Appreciate your response.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz

Off the top of my head... (Each may have it's own issues)

If upstream you add a uniqueId to all your records, then you may use a
BloomFilter to approximate if you've seen a row before.
The problem I can see with that approach is how to repopulate the bloom
filter on restarts.

If you are certain that you're not going to reprocess some data after a
certain time, i.e. there is no way I'm going to get the same data in 2
hours, it may only happen in the last 2 hours, then you may also keep the
state of uniqueId's as well, and then age them out after a certain time.

Best,
Burak

On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande 
wrote:

> Please share your thoughts.
>
> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande  > wrote:
>
>>
>>
>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> My streaming application stores lot of aggregations using mapWithState.
>>>
>>> I want to know what are all the possible ways I can make it idempotent.
>>>
>>> Please share your views.
>>>
>>> Thanks
>>>
>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 In a Wordcount application which  stores the count of all the words
 input so far using mapWithState.  How do I make sure my counts are not
 messed up if I happen to read a line more than once?

 Appreciate your response.

 Thanks

>>>
>>>
>>
>

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] 5fbd21: Translated using Weblate (Turkish)

2017-01-25 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: 5fbd21a0bd453f1c5393cc5f143a28bf36280e02
  
https://github.com/phpmyadmin/phpmyadmin/commit/5fbd21a0bd453f1c5393cc5f143a28bf36280e02
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-01-25 (Wed, 25 Jan 2017)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3199 of 3199 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Burak Yavuz

Thank you very much everyone! Hoping to help out the community as much as I
can!

Best,
Burak

On Tue, Jan 24, 2017 at 2:29 PM, Jacek Laskowski  wrote:

> Wow! At long last. Congrats Burak and Holden!
>
> p.s. I was a bit worried that the process of accepting new committers
> is equally hard as passing Sean's sanity checks for PRs, but given
> this it's so much easier it seems :D
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jan 24, 2017 at 7:13 PM, Reynold Xin  wrote:
> > Hi all,
> >
> > Burak and Holden have recently been elected as Apache Spark committers.
> >
> > Burak has been very active in a large number of areas in Spark, including
> > linear algebra, stats/maths functions in DataFrames, Python/R APIs for
> > DataFrames, dstream, and most recently Structured Streaming.
> >
> > Holden has been a long time Spark contributor and evangelist. She has
> > written a few books on Spark, as well as frequent contributions to the
> > Python API to improve its usability and performance.
> >
> > Please join me in welcoming the two!
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] 60b77e: Translated using Weblate (Turkish)

2017-01-23 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: 60b77eea6e73c67199a1f11e7d83c9c6212f08c3
  
https://github.com/phpmyadmin/phpmyadmin/commit/60b77eea6e73c67199a1f11e7d83c9c6212f08c3
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-01-24 (Tue, 24 Jan 2017)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3198 of 3198 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] d121c6: Translated using Weblate (Turkish)

2017-01-23 Thread Burak Yavuz

  Branch: refs/heads/QA_4_6
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: d121c692265078fb75b19e9b1f6eb49aae54c9ab
  
https://github.com/phpmyadmin/phpmyadmin/commit/d121c692265078fb75b19e9b1f6eb49aae54c9ab
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-01-23 (Mon, 23 Jan 2017)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3226 of 3226 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/localized_docs] 9e6df7: Translated using Weblate (Turkish)

2017-01-18 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: 9e6df7a487803450187d60e21d3c8fe1341fca8b
  
https://github.com/phpmyadmin/localized_docs/commit/9e6df7a487803450187d60e21d3c8fe1341fca8b
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-01-19 (Thu, 19 Jan 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2343 of 2343 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Burak Yavuz

Hi Maciej,

I believe it would be useful to either fix the documentation or fix the
implementation. I'll leave it to the community to comment on. The code
right now disallows intervals provided in months and years, because they
are not a "consistently" fixed amount of time. A month can be 28, 29, 30,
or 31 days. A year is 12 months for sure, but is it 360 days (sometimes
used in finance), 365 days or 366 days?

Therefore we could either:
  1) Allow windowing when intervals are given in days and less, even though
it could be 365 days, and fix the documentation.
  2) Explicitly disallow it as there may be a lot of data for a given
window, but partial aggregations should help with that.

My thoughts are to go with 1. What do you think?

Best,
Burak

On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz  wrote:

> Hi,
>
> Can I ask for some clarifications regarding intended behavior of window /
> TimeWindow?
>
> PySpark documentation states that "Windows in the order of months are not
> supported". This is further confirmed by the checks in 
> TimeWindow.getIntervalInMicroseconds
> (https://git.io/vMP5l).
>
> Surprisingly enough we can pass interval much larger than a month by
> expressing interval in days or another unit of a higher precision. So this
> fails:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
> while following is accepted:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
> with results which look sensible at first glance.
>
> Is it a matter of a faulty validation logic (months will be assigned only
> if there is a match against years or months https://git.io/vMPdi) or
> expected behavior and I simply misunderstood the intentions?
>
> --
> Best,
> Maciej
>
>

[phpMyAdmin Git] [phpmyadmin/localized_docs] 8004d4: Translated using Weblate (Turkish)

2017-01-17 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: 8004d4ceb72c7446a3dccae369a0912c8948b72c
  
https://github.com/phpmyadmin/localized_docs/commit/8004d4ceb72c7446a3dccae369a0912c8948b72c
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-01-18 (Wed, 18 Jan 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2334 of 2334 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/localized_docs] d580dc: Translated using Weblate (Turkish)

2017-01-11 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: d580dc246ce2e159cf587637bc0d3cf5b06ad8b6
  
https://github.com/phpmyadmin/localized_docs/commit/d580dc246ce2e159cf587637bc0d3cf5b06ad8b6
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-01-11 (Wed, 11 Jan 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2331 of 2331 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/localized_docs] c5fa88: Translated using Weblate (Turkish)

2017-01-09 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: c5fa88dc04a4a141da1be9af881a6f9b86575fdd
  
https://github.com/phpmyadmin/localized_docs/commit/c5fa88dc04a4a141da1be9af881a6f9b86575fdd
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2017-01-09 (Mon, 09 Jan 2017)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2331 of 2331 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/localized_docs] c2fa72: Translated using Weblate (Turkish)

2016-12-23 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: c2fa72dd06b4bd698b2d915b49c0745ef219f594
  
https://github.com/phpmyadmin/localized_docs/commit/c2fa72dd06b4bd698b2d915b49c0745ef219f594
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2016-12-23 (Fri, 23 Dec 2016)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2328 of 2328 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] 6fe9e6: Translated using Weblate (Turkish)

2016-12-21 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: 6fe9e68e762e257c7dd3ca3997fcef59e148789f
  
https://github.com/phpmyadmin/phpmyadmin/commit/6fe9e68e762e257c7dd3ca3997fcef59e148789f
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2016-12-21 (Wed, 21 Dec 2016)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3239 of 3239 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Updated] (SPARK-18952) regex strings not properly escaped in codegen for aggregations

2016-12-20 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18952:

Summary: regex strings not properly escaped in codegen for aggregations  
(was: regex strings not properly escaped in codegen)

> regex strings not properly escaped in codegen for aggregations
> --
>
> Key: SPARK-18952
> URL: https://issues.apache.org/jira/browse/SPARK-18952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>    Reporter: Burak Yavuz
>
> If I use the function regexp_extract, and then in my regex string, use `\`, 
> i.e. escape character, this fails codegen, because the `\` character is not 
> properly escaped when codegen'd.
> Example stack trace:
> {code}
> /* 059 */ private int maxSteps = 2;
> /* 060 */ private int numRows = 0;
> /* 061 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("date_format(window#325.start, 
> -MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
> /* 062 */ .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 
> 1)", org.apache.spark.sql.types.DataTypes.StringType);
> /* 063 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.LongType);
> /* 064 */ private Object emptyVBase;
> ...
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 62, Column 58: Invalid escape sequence
>   at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
>   at org.codehaus.janino.Scanner.produce(Scanner.java:604)
>   at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
>   at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
>   at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
>   at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
>   at 
> org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
>   at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
> {code}
> In the codegend expression, the literal should use `\\` instead of `\`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18952) regex strings not properly escaped in codegen

2016-12-20 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18952:
---

 Summary: regex strings not properly escaped in codegen
 Key: SPARK-18952
 URL: https://issues.apache.org/jira/browse/SPARK-18952
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Burak Yavuz


If I use the function regexp_extract, and then in my regex string, use `\`, 
i.e. escape character, this fails codegen, because the `\` character is not 
properly escaped when codegen'd.

Example stack trace:
{code}
/* 059 */ private int maxSteps = 2;
/* 060 */ private int numRows = 0;
/* 061 */ private org.apache.spark.sql.types.StructType keySchema = new 
org.apache.spark.sql.types.StructType().add("date_format(window#325.start, 
-MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
/* 062 */ .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 
1)", org.apache.spark.sql.types.DataTypes.StringType);
/* 063 */ private org.apache.spark.sql.types.StructType valueSchema = new 
org.apache.spark.sql.types.StructType().add("sum", 
org.apache.spark.sql.types.DataTypes.LongType);
/* 064 */ private Object emptyVBase;

...

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 62, 
Column 58: Invalid escape sequence
at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
at org.codehaus.janino.Scanner.produce(Scanner.java:604)
at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
at 
org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
{code}

In the codegend expression, the literal should use `\\` instead of `\`




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf

2016-12-19 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18927:
---

 Summary: MemorySink for StructuredStreaming can't recover from 
checkpoint if location is provided in conf
 Key: SPARK-18927
 URL: https://issues.apache.org/jira/browse/SPARK-18927
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18900) Flaky Test: StateStoreSuite.maintenance

2016-12-16 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18900:
---

 Summary: Flaky Test: StateStoreSuite.maintenance
 Key: SPARK-18900
 URL: https://issues.apache.org/jira/browse/SPARK-18900
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.0.2
Reporter: Burak Yavuz


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70223/testReport/




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18888) partitionBy in DataStreamWriter in Python throws _to_seq not defined

2016-12-15 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-1:

Affects Version/s: (was: 2.1.0)
   2.0.2

> partitionBy in DataStreamWriter in Python throws _to_seq not defined
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.0.2
>    Reporter: Burak Yavuz
>Priority: Blocker
>
> {code}
> python/pyspark/sql/streaming.py in partitionBy(self, *cols)
> 716 if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
> 717 cols = cols[0]
> --> 718 self._jwrite = 
> self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
> 719 return self
> 720 
> NameError: global name '_to_seq' is not defined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18888) partitionBy in DataStreamWriter in Python throws _to_seq not defined

2016-12-15 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-1:
---

 Summary: partitionBy in DataStreamWriter in Python throws _to_seq 
not defined
 Key: SPARK-1
 URL: https://issues.apache.org/jira/browse/SPARK-1
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Priority: Blocker


{code}
python/pyspark/sql/streaming.py in partitionBy(self, *cols)
716 if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
717 cols = cols[0]
--> 718 self._jwrite = 
self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
719 return self
720 

NameError: global name '_to_seq' is not defined
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite

2016-12-14 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18868:
---

 Summary: Flaky Test: StreamingQueryListenerSuite
 Key: SPARK-18868
 URL: https://issues.apache.org/jira/browse/SPARK-18868
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Reporter: Burak Yavuz


Example: 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[phpMyAdmin Git] [phpmyadmin/localized_docs] 917856: Translated using Weblate (Turkish)

2016-12-14 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/localized_docs
  Commit: 917856aa5d2fde4a7be01719d802a920df96cb4a
  
https://github.com/phpmyadmin/localized_docs/commit/917856aa5d2fde4a7be01719d802a920df96cb4a
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2016-12-14 (Wed, 14 Dec 2016)

  Changed paths:
M po/tr.mo
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (2321 of 2321 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] e73a59: Translated using Weblate (Turkish)

2016-12-13 Thread Burak Yavuz

  Branch: refs/heads/master
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: e73a59f18e949c7ee8ea2e15bbd95cfe785db674
  
https://github.com/phpmyadmin/phpmyadmin/commit/e73a59f18e949c7ee8ea2e15bbd95cfe785db674
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2016-12-13 (Tue, 13 Dec 2016)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3239 of 3239 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[phpMyAdmin Git] [phpmyadmin/phpmyadmin] 2e561a: Translated using Weblate (Turkish)

2016-12-13 Thread Burak Yavuz

  Branch: refs/heads/QA_4_6
  Home:   https://github.com/phpmyadmin/phpmyadmin
  Commit: 2e561a296e29df9c2f57d959a7b8c519921dcd25
  
https://github.com/phpmyadmin/phpmyadmin/commit/2e561a296e29df9c2f57d959a7b8c519921dcd25
  Author: Burak Yavuz <hitowerdi...@hotmail.com>
  Date:   2016-12-13 (Tue, 13 Dec 2016)

  Changed paths:
M po/tr.po

  Log Message:
  ---
  Translated using Weblate (Turkish)

Currently translated at 100.0% (3222 of 3222 strings)

[CI skip]


___
Git mailing list
Git@phpmyadmin.net
https://lists.phpmyadmin.net/mailman/listinfo/git

[jira] [Created] (SPARK-18811) Stream Source resolution should happen in StreamExecution thread, not main thread

2016-12-09 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18811:
---

 Summary: Stream Source resolution should happen in StreamExecution 
thread, not main thread
 Key: SPARK-18811
 URL: https://issues.apache.org/jira/browse/SPARK-18811
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.0.2, 2.1.0
Reporter: Burak Yavuz


When you start a stream, if we are trying to resolve the source of the stream, 
for example if we need to resolve partition columns, this could take a long 
time. This long execution time should not block the main thread where 
`query.start()` was called on. It should happen in the stream execution thread 
possibly before starting any triggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Spark Streaming - join streaming and static data

2016-12-06 Thread Burak Yavuz

Hi Daniela,

This is trivial with Structured Streaming. If your Kafka cluster is 0.10.0
or above, you may use Spark 2.0.2 to create a Streaming DataFrame from
Kafka, and then also create a DataFrame using the JDBC connection, and you
may join those. In Spark 2.1, there's support for a function called
"from_json", which should also help you easily parse your messages incoming
from Kafka.

Best,
Burak

On Tue, Dec 6, 2016 at 2:16 AM, Daniela S  wrote:

> Hi
>
> I have some questions regarding Spark Streaming.
>
> I receive a stream of JSON messages from Kafka.
> The messages consist of a timestamp and an ID.
>
> timestamp ID
> 2016-12-06 13:001
> 2016-12-06 13:405
> ...
>
> In a database I have values for each ID:
>
> ID   minute  value
> 1 0   3
> 1 1   5
> 1 2   7
> 1 3   8
> 5 0   6
> 5 1   6
> 5 2   8
> 5 3   5
> 5 4   6
>
> So I would like to join each incoming JSON message with the corresponding
> values. It should look as follows:
>
> timestamp ID   minute  value
> 2016-12-06 13:001 0   3
> 2016-12-06 13:001 1   5
> 2016-12-06 13:001 2   7
> 2016-12-06 13:001 3   8
> 2016-12-06 13:405 0   6
> 2016-12-06 13:405 1   6
> 2016-12-06 13:405 2   8
> 2016-12-06 13:405 3   5
> 2016-12-06 13:405 4   6
> ...
>
> Then I would like to add the minute values to the timestamp. I only need
> the computed timestamp and the values. So the result should look as follows:
>
> timestamp   value
> 2016-12-06 13:00  3
> 2016-12-06 13:01  5
> 2016-12-06 13:02  7
> 2016-12-06 13:03  8
> 2016-12-06 13:40  6
> 2016-12-06 13:41  6
> 2016-12-06 13:42  8
> 2016-12-06 13:43  5
> 2016-12-06 13:44  6
> ...
>
> Is this a possible use case for Spark Streaming? I thought I could join
> the streaming data with the static data but I am not sure how to add the
> minute values to the timestamp. Is this possible with Spark Streaming?
>
> Thank you in advance.
>
> Best regards,
> Daniela
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706788#comment-15706788
 ] 

Burak Yavuz commented on SPARK-18475:
-

I'd be happy to share performance results. You're right, I never tried it with 
SSL on. One thing to note is that I was never planning to have this enabled by 
default, because there is no way to think of a sane default parallelism value.

What I was hoping to achieve was provide Spark users, who may not be Kafka 
experts a "Break in case of emergency" way out. It's easy to say "Partition 
your data properly" to people, until someone upstream in your organization 
changes one thing and the data engineer has to deal with the mess of skewed 
data.

You may want to tell people, "hey increase your Kafka partitions" if you want 
to increase Kafka parallelism, but is that a viable operation when your queues 
are already messed up, and the damage has been already done. Are you going to 
have them empty the queue, delete the topic, create a topic with increased 
number of partitions and re-consume everything so that it is properly 
partitioned again?

It's easy to talk about what needs to be done, and what is the proper way to do 
things until shit hits the fan in production with something that is/was totally 
out of your control and you have to clean up the mess.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18634) Corruption and Correctness issues with exploding Python UDFs

2016-11-29 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18634:

Description: 
There are some weird issues with exploding Python UDFs in SparkSQL.

There are 2 cases where based on the DataType of the exploded column, the 
result can be flat out wrong, or corrupt. Seems like something bad is happening 
when telling Tungsten the schema of the rows during or after applying the UDF.

Please check the code below for reproduction.

Notebook: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6186780348633019/3425836135165635/4343791953238323/latest.html


  was:
There are some weird issues with exploding Python UDFs in SparkSQL.

There are 2 cases where based on the DataType of the exploded column, the 
result can be flat out wrong, or corrupt. Seems like something bad is happening 
when telling Tungsten the schema of the rows during or after applying the UDF.

Please check the code below for reproduction.

Notebook: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6186780348633019/3425836135165635/4343791953238323/latest.html
{code}
> 
from pyspark.sql.functions import *
from pyspark.sql.types import *
> 
df = spark.range(10)
Issue #1
Corrupt Data
> 
def return_range(value):
  return [(i, str(i)) for i in range(value - 1, value + 1)]

range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", 
IntegerType()),
StructField("string_val", 
StringType())])))
> 
df.select("id", explode(range_udf(df.id))).show()
+---++
| id| col|
+---++
|  0|[0,null]|
|  0|[0,null]|
|  1|[0,null]|
|  1|[0,null]|
|  2|[0,null]|
|  2|[0,null]|
|  3|[0,null]|
|  3|[0,null]|
|  4|[0,null]|
|  4|[0,null]|
|  5|[0,null]|
|  5|[0,null]|
|  6|[0,null]|
|  6|[0,null]|
|  7|[0,null]|
|  7|[0,null]|
|  8|[0,null]|
|  8|[0,null]|
|  9|[0,null]|
|  9|[0,null]|
+---++

Fine if I do the explode in the second step...
> 
df.select("id", range_udf(df.id).alias("range")).select("id", 
explode("range")).show()
+---+---+
| id|col|
+---+---+
|  0|[-1,-1]|
|  0|  [0,0]|
|  1|  [0,0]|
|  1|  [1,1]|
|  2|  [1,1]|
|  2|  [2,2]|
|  3|  [2,2]|
|  3|  [3,3]|
|  4|  [3,3]|
|  4|  [4,4]|
|  5|  [4,4]|
|  5|  [5,5]|
|  6|  [5,5]|
|  6|  [6,6]|
|  7|  [6,6]|
|  7|  [7,7]|
|  8|  [7,7]|
|  8|  [8,8]|
|  9|  [8,8]|
|  9|  [9,9]|
+---+---+

... or if I don't include the second column
> 
df.select(explode(range_udf(df.id))).show()
+---+
|col|
+---+
|[-1,-1]|
|  [0,0]|
|  [0,0]|
|  [1,1]|
|  [1,1]|
|  [2,2]|
|  [2,2]|
|  [3,3]|
|  [3,3]|
|  [4,4]|
|  [4,4]|
|  [5,5]|
|  [5,5]|
|  [6,6]|
|  [6,6]|
|  [7,7]|
|  [7,7]|
|  [8,8]|
|  [8,8]|
|  [9,9]|
+---+

Issue #2 Flat out wrong answer
> 
def return_range2(value):
  return range(value - 1, value + 1)

range_udf2 = udf(return_range2, ArrayType(IntegerType()))
> 
df.select("id", explode(range_udf2(df.id))).show()
+---+---+
| id|col|
+---+---+
|  0| 24|
|  0| 24|
|  1| 24|
|  1| 24|
|  2| 24|
|  2| 24|
|  3| 24|
|  3| 24|
|  4| 24|
|  4| 24|
|  5| 24|
|  5| 24|
|  6| 24|
|  6| 24|
|  7| 24|
|  7| 24|
|  8| 24|
|  8| 24|
|  9| 24|
|  9| 24|
+---+---+

> 
df.select("id", range_udf2(df.id).alias("range")).select("id", 
explode("range")).show()
+---+---+
| id|col|
+---+---+
|  0| -1|
|  0|  0|
|  1|  0|
|  1|  1|
|  2|  1|
|  2|  2|
|  3|  2|
|  3|  3|
|  4|  3|
|  4|  4|
|  5|  4|
|  5|  5|
|  6|  5|
|  6|  6|
|  7|  6|
|  7|  7|
|  8|  7|
|  8|  8|
|  9|  8|
|  9|  9|
+---+---+

> 
df.select(explode(range_udf2(df.id))).show()
+---+
|col|
+---+
| -1|
|  0|
|  0|
|  1|
|  1|
|  2|
|  2|
|  3|
|  3|
|  4|
|  4|
|  5|
|  5|
|  6|
|  6|
|  7|
|  7|
|  8|
|  8|
|  9|
+---+

{code}


> Corruption and Correctness issues with exploding Python UDFs
> 
>
> Key: SPARK-18634
> URL: https://issues.apache.org/jira/browse/SPARK-18634
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> There are some weird issues with exploding Python UDFs in SparkSQL.
> There are 2 cases where based on the DataType of the exploded column, the 
> result can be flat out wrong, or corrupt. Seems like something bad is 
> happening when telling Tungsten the schema of the rows during or after 
> applying the UDF.
> Please check the code below for reproduction.
> Notebook: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6186780348633019/342583613

[jira] [Updated] (SPARK-18634) Corruption and Correctness issues with exploding Python UDFs

2016-11-29 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18634:

Description: 
There are some weird issues with exploding Python UDFs in SparkSQL.

There are 2 cases where based on the DataType of the exploded column, the 
result can be flat out wrong, or corrupt. Seems like something bad is happening 
when telling Tungsten the schema of the rows during or after applying the UDF.

Please check the code below for reproduction.

Notebook: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6186780348633019/3425836135165635/4343791953238323/latest.html
{code}
> 
from pyspark.sql.functions import *
from pyspark.sql.types import *
> 
df = spark.range(10)
Issue #1
Corrupt Data
> 
def return_range(value):
  return [(i, str(i)) for i in range(value - 1, value + 1)]

range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", 
IntegerType()),
StructField("string_val", 
StringType())])))
> 
df.select("id", explode(range_udf(df.id))).show()
+---++
| id| col|
+---++
|  0|[0,null]|
|  0|[0,null]|
|  1|[0,null]|
|  1|[0,null]|
|  2|[0,null]|
|  2|[0,null]|
|  3|[0,null]|
|  3|[0,null]|
|  4|[0,null]|
|  4|[0,null]|
|  5|[0,null]|
|  5|[0,null]|
|  6|[0,null]|
|  6|[0,null]|
|  7|[0,null]|
|  7|[0,null]|
|  8|[0,null]|
|  8|[0,null]|
|  9|[0,null]|
|  9|[0,null]|
+---++

Fine if I do the explode in the second step...
> 
df.select("id", range_udf(df.id).alias("range")).select("id", 
explode("range")).show()
+---+---+
| id|col|
+---+---+
|  0|[-1,-1]|
|  0|  [0,0]|
|  1|  [0,0]|
|  1|  [1,1]|
|  2|  [1,1]|
|  2|  [2,2]|
|  3|  [2,2]|
|  3|  [3,3]|
|  4|  [3,3]|
|  4|  [4,4]|
|  5|  [4,4]|
|  5|  [5,5]|
|  6|  [5,5]|
|  6|  [6,6]|
|  7|  [6,6]|
|  7|  [7,7]|
|  8|  [7,7]|
|  8|  [8,8]|
|  9|  [8,8]|
|  9|  [9,9]|
+---+---+

... or if I don't include the second column
> 
df.select(explode(range_udf(df.id))).show()
+---+
|col|
+---+
|[-1,-1]|
|  [0,0]|
|  [0,0]|
|  [1,1]|
|  [1,1]|
|  [2,2]|
|  [2,2]|
|  [3,3]|
|  [3,3]|
|  [4,4]|
|  [4,4]|
|  [5,5]|
|  [5,5]|
|  [6,6]|
|  [6,6]|
|  [7,7]|
|  [7,7]|
|  [8,8]|
|  [8,8]|
|  [9,9]|
+---+

Issue #2 Flat out wrong answer
> 
def return_range2(value):
  return range(value - 1, value + 1)

range_udf2 = udf(return_range2, ArrayType(IntegerType()))
> 
df.select("id", explode(range_udf2(df.id))).show()
+---+---+
| id|col|
+---+---+
|  0| 24|
|  0| 24|
|  1| 24|
|  1| 24|
|  2| 24|
|  2| 24|
|  3| 24|
|  3| 24|
|  4| 24|
|  4| 24|
|  5| 24|
|  5| 24|
|  6| 24|
|  6| 24|
|  7| 24|
|  7| 24|
|  8| 24|
|  8| 24|
|  9| 24|
|  9| 24|
+---+---+

> 
df.select("id", range_udf2(df.id).alias("range")).select("id", 
explode("range")).show()
+---+---+
| id|col|
+---+---+
|  0| -1|
|  0|  0|
|  1|  0|
|  1|  1|
|  2|  1|
|  2|  2|
|  3|  2|
|  3|  3|
|  4|  3|
|  4|  4|
|  5|  4|
|  5|  5|
|  6|  5|
|  6|  6|
|  7|  6|
|  7|  7|
|  8|  7|
|  8|  8|
|  9|  8|
|  9|  9|
+---+---+

> 
df.select(explode(range_udf2(df.id))).show()
+---+
|col|
+---+
| -1|
|  0|
|  0|
|  1|
|  1|
|  2|
|  2|
|  3|
|  3|
|  4|
|  4|
|  5|
|  5|
|  6|
|  6|
|  7|
|  7|
|  8|
|  8|
|  9|
+---+

{code}

  was:
There are some weird issues with exploding Python UDFs in SparkSQL.

There are 2 cases where based on the DataType of the exploded column, the 
result can be flat out wrong, or corrupt. Seems like something bad is happening 
when telling Tungsten the schema of the rows during or after applying the UDF.

Please check the attached notebook for reproduction.




> Corruption and Correctness issues with exploding Python UDFs
> 
>
> Key: SPARK-18634
> URL: https://issues.apache.org/jira/browse/SPARK-18634
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> There are some weird issues with exploding Python UDFs in SparkSQL.
> There are 2 cases where based on the DataType of the exploded column, the 
> result can be flat out wrong, or corrupt. Seems like something bad is 
> happening when telling Tungsten the schema of the rows during or after 
> applying the UDF.
> Please check the code below for reproduction.
> Notebook: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6186780348633019/3425836135165635/4343791953238323/latest.html
> {code}
> > 
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> > 
> df =

[jira] [Updated] (SPARK-18634) Corruption and Correctness issues with exploding Python UDFs

2016-11-29 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18634:

Summary: Corruption and Correctness issues with exploding Python UDFs  
(was: Issues with exploding Python UDFs)

> Corruption and Correctness issues with exploding Python UDFs
> 
>
> Key: SPARK-18634
> URL: https://issues.apache.org/jira/browse/SPARK-18634
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>    Reporter: Burak Yavuz
>
> There are some weird issues with exploding Python UDFs in SparkSQL.
> There are 2 cases where based on the DataType of the exploded column, the 
> result can be flat out wrong, or corrupt. Seems like something bad is 
> happening when telling Tungsten the schema of the rows during or after 
> applying the UDF.
> Please check the attached notebook for reproduction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18634) Corruption and Correctness issues with exploding Python UDFs

2016-11-29 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18634:

Description: 
There are some weird issues with exploding Python UDFs in SparkSQL.

There are 2 cases where based on the DataType of the exploded column, the 
result can be flat out wrong, or corrupt. Seems like something bad is happening 
when telling Tungsten the schema of the rows during or after applying the UDF.

Please check the attached notebook for reproduction.



  was:
There are some weird issues with exploding Python UDFs in SparkSQL.

There are 2 cases where based on the DataType of the exploded column, the 
result can be flat out wrong, or corrupt. Seems like something bad is happening 
when telling Tungsten the schema of the rows during or after applying the UDF.

Please check the attached notebook for reproduction.


> Corruption and Correctness issues with exploding Python UDFs
> 
>
> Key: SPARK-18634
> URL: https://issues.apache.org/jira/browse/SPARK-18634
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>    Reporter: Burak Yavuz
>
> There are some weird issues with exploding Python UDFs in SparkSQL.
> There are 2 cases where based on the DataType of the exploded column, the 
> result can be flat out wrong, or corrupt. Seems like something bad is 
> happening when telling Tungsten the schema of the rows during or after 
> applying the UDF.
> Please check the attached notebook for reproduction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18634) Issues with exploding Python UDFs

2016-11-29 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18634:
---

 Summary: Issues with exploding Python UDFs
 Key: SPARK-18634
 URL: https://issues.apache.org/jira/browse/SPARK-18634
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.2, 2.1.0
Reporter: Burak Yavuz


There are some weird issues with exploding Python UDFs in SparkSQL.

There are 2 cases where based on the DataType of the exploded column, the 
result can be flat out wrong, or corrupt. Seems like something bad is happening 
when telling Tungsten the schema of the rows during or after applying the UDF.

Please check the attached notebook for reproduction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18407) Inferred partition columns cause assertion error

2016-11-25 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696337#comment-15696337
 ] 

Burak Yavuz commented on SPARK-18407:
-

This is also resolved as part of 
https://issues.apache.org/jira/browse/SPARK-18510

> Inferred partition columns cause assertion error
> 
>
> Key: SPARK-18407
> URL: https://issues.apache.org/jira/browse/SPARK-18407
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Michael Armbrust
>Priority: Critical
>
> [This 
> assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
>  fails when you run a stream against json data that is stored in partitioned 
> folders, if you manually specify the schema and that schema omits the 
> partitioned columns.
> My hunch is that we are inferring those columns even though the schema is 
> being passed in manually and adding them to the end.
> While we are fixing this bug, it would be nice to make the assertion better.  
> Truncating is not terribly useful as, at least in my case, it truncated the 
> most interesting part.  I changed it to this while debugging:
> {code}
>   s"""
>  |Batch does not have expected schema
>  |Expected: ${output.mkString(",")}
>  |Actual: ${newPlan.output.mkString(",")}
>  |
>  |== Original ==
>  |$logicalPlan
>  |
>  |== Batch ==
>  |$newPlan
>""".stripMargin
> {code}
> I also tried specifying the partition columns in the schema and now it 
> appears that they are filled with corrupted data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-20 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681677#comment-15681677
 ] 

Burak Yavuz commented on SPARK-18510:
-

No. Working on a separate fix

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>    Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680328#comment-15680328
 ] 

Burak Yavuz commented on SPARK-18510:
-

cc [~r...@databricks.com] I marked this as a blocker as it is pretty nasty. 
Feel free to downgrade if you don't think so, or feel like the semantics of 
what I'm doing is wrong.

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>    Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18510:

Description: 
Not sure if this is a regression from 2.0 to 2.1. I was investigating this for 
Structured Streaming, but it seems it affects batch data as well.

Here's the issue:
If I specify my schema when doing
{code}
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
{code}

but if the partition inference can infer it as IntegerType or I assume LongType 
or DoubleType (basically fixed size types), then once UnsafeRows are generated, 
your data will be corrupted.

Reproduction:
{code}
val createArray = udf { (length: Long) =>
for (i <- 1 to length.toInt) yield i.toString
}
spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
'part).coalesce(1).write
.partitionBy("part", "id")
.mode("overwrite")
.parquet(src.toString)
val schema = new StructType()
.add("id", StringType)
.add("part", IntegerType)
.add("ex", ArrayType(StringType))
spark.read
  .schema(schema)
  .format("parquet")
  .load(src.toString)
  .show()
{code}
The UDF is useful for creating a row long enough so that you don't hit other 
weird NullPointerExceptions caused for the same reason I believe.
Output:
{code}
+-+++
|   id|part|  ex|
+-+++
|�|   1|[1, 2, 3, 4, 5, 6...|
| |   0|[1, 2, 3, 4, 5, 6...|
|  |   3|[1, 2, 3, 4, 5, 6...|
|   |   2|[1, 2, 3, 4, 5, 6...|
||   1|  [1, 2, 3, 4, 5, 6]|
| |   0| [1, 2, 3, 4, 5]|
|  |   3|[1, 2, 3, 4]|
|   |   2|   [1, 2, 3]|
||   1|  [1, 2]|
| |   0| [1]|
+-+++
{code}

I was hoping to fix the issue as part of SPARK-18407 but it seems it's not only 
applicable to StructuredStreaming and deserves it's own JIRA.

  was:
Not sure if this is a regression from 2.0 to 2.1. I was investigating this for 
Structured Streaming, but it seems it affects batch data as well.

Here's the issue:
If I specify my schema when doing
{code}
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
{code}

but if the partition inference can infer it as IntegerType or I assume LongType 
or DoubleType (basically fixed size types), then once UnsafeRows are generated, 
your data will be corrupted.

Reproduction:
{code}
val createArray = udf { (length: Long) =>
for (i <- 1 to length.toInt) yield i.toString
}
spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
'part).coalesce(1).write
.partitionBy("part", "id")
.mode("overwrite")
.parquet(src.toString)
val schema = new StructType()
.add("id", StringType)
.add("part", IntegerType)
.add("ex", ArrayType(StringType))
spark.read
  .schema(schema)
  .format("parquet")
  .load(src.toString)
  .show()
{code}

Output:
{code}
+-+++
|   id|part|  ex|
+-+++
|�|   1|[1, 2, 3, 4, 5, 6...|
| |   0|[1, 2, 3, 4, 5, 6...|
|  |   3|[1, 2, 3, 4, 5, 6...|
|   |   2|[1, 2, 3, 4, 5, 6...|
||   1|  [1, 2, 3, 4, 5, 6]|
| |   0| [1, 2, 3, 4, 5]|
|  |   3|[1, 2, 3, 4]|
|   |   2|   [1, 2, 3]|
||   1|  [1, 2]|
| |   0| [1]|
+-+++
{code}

I was hoping to fix the issue as part of SPARK-18407 but it seems it's not only 
applicable to StructuredStreaming and deserves it's own JIRA.


> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray(

[jira] [Created] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18510:
---

 Summary: Partition schema inference corrupts data
 Key: SPARK-18510
 URL: https://issues.apache.org/jira/browse/SPARK-18510
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Priority: Blocker


Not sure if this is a regression from 2.0 to 2.1. I was investigating this for 
Structured Streaming, but it seems it affects batch data as well.

Here's the issue:
If I specify my schema when doing
{code}
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
{code}

but if the partition inference can infer it as IntegerType or I assume LongType 
or DoubleType (basically fixed size types), then once UnsafeRows are generated, 
your data will be corrupted.

Reproduction:
{code}
val createArray = udf { (length: Long) =>
for (i <- 1 to length.toInt) yield i.toString
}
spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
'part).coalesce(1).write
.partitionBy("part", "id")
.mode("overwrite")
.parquet(src.toString)
val schema = new StructType()
.add("id", StringType)
.add("part", IntegerType)
.add("ex", ArrayType(StringType))
spark.read
  .schema(schema)
  .format("parquet")
  .load(src.toString)
  .show()
{code}

Output:
{code}
+-+++
|   id|part|  ex|
+-+++
|�|   1|[1, 2, 3, 4, 5, 6...|
| |   0|[1, 2, 3, 4, 5, 6...|
|  |   3|[1, 2, 3, 4, 5, 6...|
|   |   2|[1, 2, 3, 4, 5, 6...|
||   1|  [1, 2, 3, 4, 5, 6]|
| |   0| [1, 2, 3, 4, 5]|
|  |   3|[1, 2, 3, 4]|
|   |   2|   [1, 2, 3]|
||   1|  [1, 2]|
| |   0| [1]|
+-+++
{code}

I was hoping to fix the issue as part of SPARK-18407 but it seems it's not only 
applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 1060 matches

Mail list logo