[Structured streaming, V2] commit on ContinuousReader

2018-05-03 Thread Jiří Syrový
Version: 2.3, DataSourceV2, ContinuousReader

Hi,

We're creating a new data source to fetch data from streaming source that
requires commiting received data and we would like to commit data once in a
while after it has been retrieved and correctly processed and then fetch
more.

One option could be to rely on spark committing already read data
using *commit(end:
Offset)* that is present in *ContinuousReader (v2.reader.streaming)*, but
it seems that this method is never called.

The question is if this method *commit(end: Offset) is ever* used and when?
I went through part of Spark code base, but haven't really found any place
where it could be called.

Thanks,
Jiri


Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Jiří Syrový
Hi,

another nice approach is to use instead of it Reader monad and some
framework to support this approach (e.g. Grafter -
https://github.com/zalando/grafter). It's lightweight and helps a bit with
dependencies issues.

2016-12-28 22:55 GMT+01:00 Lars Albertsson :

> Do you really need dependency injection?
>
> DI is often used for testing purposes. Data processing jobs are easy
> to test without DI, however, due to their functional and synchronous
> nature. Hence, DI is often unnecessary for testing data processing
> jobs, whether they are batch or streaming jobs.
>
> Or do you want to use DI for other reasons?
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> https://twitter.com/lalleal
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>
>
> On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri
>  wrote:
> > Hello Community,
> >
> > Current approach I am using for Spark Job Development with Scala + SBT
> and
> > Uber Jar with yml properties file to pass configuration parameters. But
> If i
> > would like to use Dependency Injection and MicroService Development like
> > Spring Boot feature in Scala then what would be the standard approach.
> >
> > Thanks
> >
> > Chetan
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: NegativeArraySizeException / segfault

2016-05-30 Thread Jiří Syrový
I think I saw this one already as the first indication that something is
wrong and it was related to
https://issues.apache.org/jira/browse/SPARK-13516

2016-05-28 1:34 GMT+02:00 Koert Kuipers :

> it seemed to be related to an Aggregator, so for tests we replaced it with
> an ordinary Dataset.reduce operation, and now we got:
>
> java.lang.NegativeArraySizeException
> at
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
> at
> org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:821)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
> Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> i did get the generated code, but its like 17 subtrees and its not a test
> but a company real program so i cannot just send it over.
>
> i will try to create a small test program to reproduce it.
>
> On Fri, May 27, 2016 at 4:25 PM, Reynold Xin  wrote:
>
>> They should get printed if you turn on debug level logging.
>>
>> On Fri, May 27, 2016 at 1:00 PM, Koert Kuipers  wrote:
>>
>>> hello all,
>>> after getting our unit tests to pass on spark 2.0.0-SNAPSHOT we are now
>>> trying to run some algorithms at scale on our cluster.
>>> unfortunately this means that when i see errors i am having a harder
>>> time boiling it down to a small reproducible example.
>>>
>>> today we are running an iterative algo using the dataset api and we are
>>> seeing tasks fail with errors which seem to related to unsafe operations.
>>> the same tasks succeed without issues in our unit tests.
>>>
>>> i see either:
>>>
>>> 16/05/27 12:54:46 ERROR executor.Executor: Exception in task 31.0 in
>>> stage 21.0 (TID 1073)
>>> java.lang.NegativeArraySizeException
>>> at
>>> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>>> at
>>> org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:821)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>>> Source)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>>> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>>> Source)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>>> Source)
>>> at
>>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>> at
>>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:359)
>>> at
>>> org.apache.spark.sql.execution.aggregate.SortBasedAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregateExec.scala:74)
>>> at
>>> org.apache.spark.sql.execution.aggregate.SortBasedAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregateExec.scala:71)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>>> at
>>> 

Fwd: Aggregation + Adding static column + Union + Projection = Problem

2016-02-26 Thread Jiří Syrový
Hi,

I've recently noticed a bug in Spark (branch 1.6) that appears if you do
the following

Let's have some DataFrame called df.

1) Aggregation of multiple columns on the Dataframe df and store result as
result_agg_1
2) Do another aggregation of multiple columns, but on one less grouping
columns and store the result as result_agg_2
3) Align the result of second aggregation by adding missing grouping column
with value empty lit("")
4) Union result_agg_1 and result_agg_2
5) Do the projection from "sum(count_column)" to "count_column" for all
aggregated columns.

The result is structurally inconsistent DataFrame that has all the data
coming from result_agg_1 shifted.

An example of stripped down code and example result can be seen here:

https://gist.github.com/xjrk58/e0c7171287ee9bdc8df8
https://gist.github.com/xjrk58/7a297a42ebb94f300d96

Best,
Jiri Syrovy


Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Jiří Syrový
+1 Tested in standalone mode and so far seems to be fairly stable.

2015-12-16 22:32 GMT+01:00 Michael Armbrust :

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by execution.
>- SPARK-12258  correct
>passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>- SPARK-11787  Parquet
>Performance - Improve Parquet scan performance when using flat schemas.
>- SPARK-10810 
>Session Management - Isolated devault database (i.e USE mydb) even on
>shared clusters.
>- SPARK-   Dataset
>API - A type-safe API (similar to RDDs) that performs many operations
>on serialized binary data and code generation (i.e. Project Tungsten).
>- SPARK-1  Unified
>Memory Management - Shared memory for execution and caching instead of
>exclusive division of the regions.
>- SPARK-11197  SQL
>Queries on Files - Concise syntax for running SQL queries over files
>of any supported format without registering a table.
>- SPARK-11745  Reading
>non-standard JSON files - Added options to read non-standard JSON
>files (e.g. single-quotes, unquoted attributes)
>- SPARK-10412  
> Per-operator
>Metrics for SQL Execution - Display statistics on a peroperator basis
>for memory usage and spilled data size.
>- SPARK-11329  Star
>(*) expansion for StructTypes - Makes it easier to nest and unest
>arbitrary numbers of