[CFP] DataWorks Summit Europe 2018 - Call for abstracts

2017-12-09 Thread Yanbo Liang
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19, 
2018. This is a great place to talk about work you are doing in Apache Spark or 
how you are using Spark for SQL/streaming processing, machine learning and data 
science. Information on submitting an abstract is at 
https://dataworkssummit.com/berlin-2018/ 
 .

Tracks:
Data Warehousing and Operational Data Stores
Artificial Intelligence and Data Science
Big Compute and Storage
Cloud and Operations
Governance and Security
Cyber Security
IoT and Streaming
Enterprise Adoption

Deadline: December 15th, 2017

Re: queryable state & streaming

2017-12-09 Thread Stavros Kontopoulos
Nice I was looking for a jira. So I agree we should justify why we are
building something. Now to that direction here is what I have seen from my
experience.
People quite often use state within their streaming app and may have large
states (TBs). Shortening the pipeline by not having to copy data (to
Cassandra for example for serving) is an advantage, in terms of at least
latency and complexity.
This can be true if we advantage of state checkpointing (locally could be
RocksDB or in general HDFS the latter is currently supported)  along with
an API to efficiently query data.
Some use cases I see:

- real-time dashboards and real-time reporting, the faster the better
- monitoring of state for operational reasons, app health etc...
- integrating with external services via an API eg. making accessible
 aggregations over time windows to some third party service within your
system

Regarding requirements here are some of them:
- support of an API to expose state (could be done at the spark driver),
like rest.
- supporting dynamic allocation (not sure how it affects state management)
- an efficient way to talk to executors to get the state (rpc?)
- making local state more efficient and easier accessible with an embedded
db (I dont think this is supported from what I see, maybe wrong)?
Some people are already working with such techs and some stuff could be
re-used: https://issues.apache.org/jira/browse/SPARK-20641

Best,
Stavros


On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust 
wrote:

> https://issues.apache.org/jira/browse/SPARK-16738
>
> I don't believe anyone is working on it yet.  I think the most useful
> thing is to start enumerating requirements and use cases and then we can
> talk about how to build it.
>
> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
> st.kontopou...@gmail.com> wrote:
>
>> Cool Burak do you have a pointer, should I take the initiative for a
>> first design document or Databricks is working on it?
>>
>> Best,
>> Stavros
>>
>> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz  wrote:
>>
>>> Hi Stavros,
>>>
>>> Queryable state is definitely on the roadmap! We will revamp the
>>> StateStore API a bit, and a queryable StateStore is definitely one of the
>>> things we are thinking about during that revamp.
>>>
>>> Best,
>>> Burak
>>>
>>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" 
>>> wrote:
>>>
 Just to re-phrase my question: Would query-able state make a viable
 SPIP?

 Regards,
 Stavros

 On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
 st.kontopou...@gmail.com> wrote:

> Hi,
>
> Maybe this has been discussed before. Given the fact that many
> streaming apps out there use state extensively, could be a good idea to
> make Spark expose streaming state with an external API like other
> systems do (Kafka streams, Flink etc), in order to facilitate
> interactive queries?
>
> Regards,
> Stavros
>


>>
>


Re: RDD[internalRow] -> DataSet

2017-12-09 Thread Jacek Laskowski
Hi Satyajit,

That's exactly what Dataset.rdd does -->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala?utf8=%E2%9C%93#L2916-L2921

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Fri, Dec 8, 2017 at 5:25 AM, satyajit vegesna  wrote:

> Hi All,
>
> Is there a way to convert RDD[internalRow] to Dataset , from outside spark
> sql package.
>
> Regards,
> Satyajit.
>


Re: BUILD FAILURE due to...not found: value AnalysisBarrier in spark-catalyst_2.11?

2017-12-09 Thread Jacek Laskowski
Hi,

Thanks Sean! You're right -- my local repo got hosed. I don't know why the
patch with AnalysisBarrier didn't go through.

Speaking of the patch [1] I've noticed a sentence in the scaladoc of
AnalysisBarrier that does not make much sense to me. There's something
missing in it, isn't there?

> The SQL Analyzer goes through a whole query plan even most part of it is
analyzed.

What is this sentence telling me? `if` missing in between "even most"?

[1]
https://github.com/apache/spark/blob/00d176d2fe7bbdf55cb3146a9cb04ca99b1858b7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala?utf8=%E2%9C%93#L890

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Fri, Dec 8, 2017 at 6:40 PM, Sean Owen  wrote:

> Build is fine for me, and on Jenkins. Try a clean build?
>
> On Fri, Dec 8, 2017 at 11:04 AM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Just got BUILD FAILURE and have been wondering if it's just me or is this
>> a known issue that's being worked on?
>>
>> (Sorry if that's just my local setup that I got broken)
>>
>> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
>> spark-catalyst_2.11 ---
>> [INFO] Using zinc server for incremental compilation
>> [warn] Pruning sources from previous analysis, due to incompatible
>> CompileSetup.
>> [info] Compiling 222 Scala sources and 27 Java sources to
>> /Users/jacek/dev/oss/spark/sql/catalyst/target/scala-2.11/classes...
>> [error] /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/
>> org/apache/spark/sql/catalyst/analysis/Analyzer.scala:728: not found:
>> value AnalysisBarrier
>> [error]   AnalysisBarrier(newRight)
>> [error]   ^
>> [error] /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/
>> org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1081: not found:
>> value AnalysisBarrier
>> [error]   case sa @ Sort(_, _, AnalysisBarrier(child: Aggregate)) =>
>> sa
>> [error]^
>> [error] /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/
>> org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1134: not found:
>> value AnalysisBarrier
>> [error] return AnalysisBarrier(plan)
>> [error]^
>> [error] /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/
>> org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1409: not found:
>> value AnalysisBarrier
>> [error]   case filter @ Filter(havingCondition,
>> AnalysisBarrier(aggregate: Aggregate)) =>
>> [error] ^
>> [error] /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/
>> org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1410: not found:
>> value AnalysisBarrier
>> [error] apply(Filter(havingCondition, aggregate)).mapChildren(
>> AnalysisBarrier)
>> [error]   ^
>> [error] /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/
>> org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1470: not found:
>> value AnalysisBarrier
>> [error]   case sort @ Sort(sortOrder, global,
>> AnalysisBarrier(aggregate: Aggregate)) =>
>> [error]   ^
>> [error] /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/
>> org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1471: not found:
>> value AnalysisBarrier
>> [error] apply(Sort(sortOrder, global, aggregate)).mapChildren(
>> AnalysisBarrier)
>> [error]   ^
>> [error] /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/
>> org/apache/spark/sql/catalyst/analysis/Analyzer.scala:2345: not found:
>> value AnalysisBarrier
>> [error] case AnalysisBarrier(child) => child
>> [error]  ^
>> [error] 8 errors found
>> [error] Compile failed at Dec 8, 2017 5:58:10 PM [8.170s]
>> [INFO] 
>> 
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>