Re: Apache Tinkerpop & Geode Integration?

2015-11-25 Thread Vasiliki Kalavri
Hi James,

I've just subscribed to the Tinkerpop dev mailing list. Could you please
send a reply to the thread, so then I can reply to it?
I'm not sure how I can reply to the thread otherwise...
I also saw that there is a grafos.ml project thread. I could also provide
some input there :)

Thanks!
-Vasia.

On 25 November 2015 at 15:09, James Thornton 
wrote:

> Hi Vasia -
>
> Yes, a FlinkGraphComputer should be a straight-forward first step. Also, on
> the Apache Tinkerpop dev mailing list, Marko thought it might be cool if
> there was a "Graph API" similar to the "Table API" -- hooking in Gremlin to
> Flink's fluent API would give Flink users a full graph query language.
>
> Stephen Mallette is a TinkerPop core contributor, and he has already
> started working on a FlinkGraphComputer. There is a Flink/Tinkerpop thread
> on the TinkerPop dev list -- it would be great to have you part of the
> conversation there too as we work on the integration:
>
>http://mail-archives.apache.org/mod_mbox/incubator-tinkerpop-dev/
>
> Thanks, Vasia.
>
> - James
>
>
> On Mon, Nov 23, 2015 at 10:28 AM, Vasiliki Kalavri <
> vasilikikala...@gmail.com> wrote:
>
> > Hi James,
> >
> > thank you for your e-mail and your interest in Flink :)
> >
> > I've recently taken a _quick_ look into Apache TinkerPop and I think it'd
> > be very interesting to integrate with Flink/Gelly.
> > Are you thinking about something like a Flink GraphComputer, similar to
> > Giraph and Spark GraphComputer's?
> > I believe such an integration should be straight-forward to implement.
> You
> > can start by looking into Flink iteration operators [1] and Gelly
> iteration
> > abstractions [2].
> >
> > Regarding Apache Geode, I'm not familiar with project, but I'll try to
> take
> > a look in the following days!
> >
> > Cheers,
> > -Vasia.
> >
> >
> > [1]:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators
> > [2]:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html#iterative-graph-processing
> >
> >
> > On 20 November 2015 at 08:32, James Thornton 
> > wrote:
> >
> > > Hi -
> > >
> > > This is James Thornton (espeed) from the Apache Tinkerpop project (
> > > http://tinkerpop.incubator.apache.org/).
> > >
> > > The Flink iterators should pair well with Gremlin's Graph Traversal
> > Machine
> > > (
> > >
> > >
> >
> http://www.datastax.com/dev/blog/the-benefits-of-the-gremlin-graph-traversal-machine
> > > )
> > > -- it would be good to coordinate on creating an integration.
> > >
> > > Also, Apache Geode made a splash today on HN (
> > > https://news.ycombinator.com/item?id=10596859) -- connecting Geode and
> > > Flink would be killer. Here's the Geode/Spark connector for
> refefference:
> > >
> > >
> > >
> > >
> >
> https://github.com/apache/incubator-geode/tree/develop/gemfire-spark-connector
> > >
> > > - James
> > >
> >
>


Re: [VOTE] Release Apache Flink 0.10.1 (release-0.10.0-rc1)

2015-11-25 Thread Aljoscha Krettek
+1

I ran an example with a custom operator that processes high-volume kafka 
input/output and has a large state size. I ran this on 10 GCE nodes.

> On 25 Nov 2015, at 14:58, Till Rohrmann  wrote:
> 
> Alright, then I withdraw my remark concerning testdata.avro.
> 
> On Wed, Nov 25, 2015 at 2:56 PM, Stephan Ewen  wrote:
> 
>> @Till I think the avro test data file is okay, the "no binaries" policy
>> refers to binary executables, as far as I know.
>> 
>> On Wed, Nov 25, 2015 at 2:54 PM, Till Rohrmann 
>> wrote:
>> 
>>> Checked checksums for src release and Hadoop 2.7 Scala 2.10 release
>>> 
>>> Checked binaries in source release
>>> - contains ./flink-staging/flink-avro/src/test/resources/testdata.avro
>>> 
>>> License
>>> - no new files added which are relevant for licensing
>>> 
>>> Build Flink and run tests from source release for Hadoop 2.5.1
>>> 
>>> Checked empty that log files don't contain exceptions and out files are
>>> empty
>>> 
>>> Run all examples with Hadoop 2.7 Scala 2.10 binaries via FliRTT tool on 4
>>> node standalone cluster and YARN cluster
>>> 
>>> Tested planVisualizer
>>> 
>>> Tested flink command line client
>>> - tested info command
>>> - tested -p option
>>> 
>>> Tested cluster HA in standalone mode => working
>>> 
>>> Tested cluster HA on YARN (2.7.1) => working
>>> 
>>> Except for the avro testdata file which is contained in the source
>> release,
>>> I didn't find anything.
>>> 
>>> +1 for releasing and removing the testdata file for the next release.
>>> 
>>> On Wed, Nov 25, 2015 at 2:33 PM, Robert Metzger 
>>> wrote:
>>> 
 +1
 
 - Build a maven project with the staging repository
 - started Flink on YARN on a CDH 5.4.5 / Hadoop 2.6.0-cdh5.4.5 cluster
>>> with
 YARN and HDFS HA
 - ran some kafka (0.8.2.0) read / write experiments
 - job cancellation with yarn is working ;)
 
 I found the following issue while testing:
 https://issues.apache.org/jira/browse/FLINK-3078 but it was already in
 0.10.0 and its not super critical bc the JobManager container will be
 killed by YARN after a few minutes.
 
 
 I'll extend the vote until tomorrow Thursday, November 26.
 
 
 On Tue, Nov 24, 2015 at 1:54 PM, Stephan Ewen 
>> wrote:
 
> @Gyula: I think it affects users, so should definitely be fixed very
>>> soon
> (either 0.10.1 or 0.10.2)
> 
> Still checking whether Robert's current version fix solves it now, or
> not...
> 
> On Tue, Nov 24, 2015 at 1:46 PM, Vyacheslav Zholudev <
> vyacheslav.zholu...@gmail.com> wrote:
> 
>> I can confirm that the build works fine when increasing a max
>> number
>>> of
>> opened files. Sorry for confusion.
>> 
>> 
>> 
>> --
>> View this message in context:
>> 
> 
 
>>> 
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-Apache-Flink-0-10-1-release-0-10-0-rc1-tp9296p9327.html
>> Sent from the Apache Flink Mailing List archive. mailing list
>> archive
 at
>> Nabble.com.
>> 
> 
 
>>> 
>> 



Re: [VOTE] Release Apache Flink 0.10.1 (release-0.10.0-rc1)

2015-11-25 Thread Ufuk Celebi
+1

- Verified hashes and signatures
- Ran example jobs on YARN with vanilla Hadoop vesions (on 4 GCE nodes):
  * 2.7.1 with Flink Hadoop 2.7 binary, Scala 2.10 and 11
  * 2.6.2 with Flink Hadoop 2.6 binary, Scala 2.10
  * 2.4.1 with Flink Hadoop 2.4 binary, Scala 2.10
  * 2.3.0 with Flink Hadoop 2 binary, Scala 2.10
- Cancelled a restarting job via CLI and web interface
- Ran simple Kafka read-write pipeline
- Ran manual tests
- Ran examples on local cluster

> On 25 Nov 2015, at 17:45, Aljoscha Krettek  wrote:
> 
> +1
> 
> I ran an example with a custom operator that processes high-volume kafka 
> input/output and has a large state size. I ran this on 10 GCE nodes.
> 
>> On 25 Nov 2015, at 14:58, Till Rohrmann  wrote:
>> 
>> Alright, then I withdraw my remark concerning testdata.avro.
>> 
>> On Wed, Nov 25, 2015 at 2:56 PM, Stephan Ewen  wrote:
>> 
>>> @Till I think the avro test data file is okay, the "no binaries" policy
>>> refers to binary executables, as far as I know.
>>> 
>>> On Wed, Nov 25, 2015 at 2:54 PM, Till Rohrmann 
>>> wrote:
>>> 
 Checked checksums for src release and Hadoop 2.7 Scala 2.10 release
 
 Checked binaries in source release
 - contains ./flink-staging/flink-avro/src/test/resources/testdata.avro
 
 License
 - no new files added which are relevant for licensing
 
 Build Flink and run tests from source release for Hadoop 2.5.1
 
 Checked empty that log files don't contain exceptions and out files are
 empty
 
 Run all examples with Hadoop 2.7 Scala 2.10 binaries via FliRTT tool on 4
 node standalone cluster and YARN cluster
 
 Tested planVisualizer
 
 Tested flink command line client
 - tested info command
 - tested -p option
 
 Tested cluster HA in standalone mode => working
 
 Tested cluster HA on YARN (2.7.1) => working
 
 Except for the avro testdata file which is contained in the source
>>> release,
 I didn't find anything.
 
 +1 for releasing and removing the testdata file for the next release.
 
 On Wed, Nov 25, 2015 at 2:33 PM, Robert Metzger 
 wrote:
 
> +1
> 
> - Build a maven project with the staging repository
> - started Flink on YARN on a CDH 5.4.5 / Hadoop 2.6.0-cdh5.4.5 cluster
 with
> YARN and HDFS HA
> - ran some kafka (0.8.2.0) read / write experiments
> - job cancellation with yarn is working ;)
> 
> I found the following issue while testing:
> https://issues.apache.org/jira/browse/FLINK-3078 but it was already in
> 0.10.0 and its not super critical bc the JobManager container will be
> killed by YARN after a few minutes.
> 
> 
> I'll extend the vote until tomorrow Thursday, November 26.
> 
> 
> On Tue, Nov 24, 2015 at 1:54 PM, Stephan Ewen 
>>> wrote:
> 
>> @Gyula: I think it affects users, so should definitely be fixed very
 soon
>> (either 0.10.1 or 0.10.2)
>> 
>> Still checking whether Robert's current version fix solves it now, or
>> not...
>> 
>> On Tue, Nov 24, 2015 at 1:46 PM, Vyacheslav Zholudev <
>> vyacheslav.zholu...@gmail.com> wrote:
>> 
>>> I can confirm that the build works fine when increasing a max
>>> number
 of
>>> opened files. Sorry for confusion.
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> 
>> 
> 
 
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-Apache-Flink-0-10-1-release-0-10-0-rc1-tp9296p9327.html
>>> Sent from the Apache Flink Mailing List archive. mailing list
>>> archive
> at
>>> Nabble.com.
>>> 
>> 
> 
 
>>> 
> 



Re: [VOTE] Release Apache Flink 0.10.1 (release-0.10.0-rc1)

2015-11-25 Thread Henry Saputra
+1

LICENSE file looks good in source artifact
NOTICE file looks good in source artifact
Signature file looks good in source artifact
Hash files looks good in source artifact
No 3rd party executables in source artifact
Source compiled
All tests are passed
Run standalone mode test app

- Henry

On Mon, Nov 23, 2015 at 4:45 AM, Robert Metzger  wrote:
> Hi All,
>
> this is the first bugfix release for the 0.10 series of Flink.
> I've CC'ed the user@ list if users are interested in helping to verify the
> release.
>
> It contains fixes for critical issues, in particular:
> - FLINK-3021 Fix class loading issue for streaming sources
> - FLINK-2974 Add periodic offset committer for Kafka
> - FLINK-2977 Using reflection to load HBase Kerberos tokens
> - FLINK-3024 Fix TimestampExtractor.getCurrentWatermark() Behaviour
> - FLINK-2967 Increase timeout for LOCAL_HOST address detection stratey
> - FLINK-3025 [kafka consumer] Bump transitive ZkClient dependency
> - FLINK-2989 job cancel button doesn't work on YARN
> - FLINK-3032: Flink does not start on Hadoop 2.7.1 (HDP), due to class
> conflict
> - FLINK-3011, 3019, 3028 Cancel jobs in RESTARTING state
>
> This is the guide on how to verify a release:
> https://cwiki.apache.org/confluence/display/FLINK/Releasing
>
> During the testing, please focus on trying out Flink on different Hadoop
> platforms: We changed the way how Hadoop's Maven dependencies are packaged,
> so maybe there are issues with different Hadoop distributions.
> The Kafka consumer also changed a bit, would be good to test it on a
> cluster.
>
> -
>
> Please vote on releasing the following candidate as Apache Flink version
> 0.10.1:
>
> The commit to be voted on:
> http://git-wip-us.apache.org/repos/asf/flink/commit/2e9b2316
>
> Branch:
> release-0.10.1-rc1 (see
> https://git1-us-west.apache.org/repos/asf/flink/?p=flink.git)
>
> The release artifacts to be voted on can be found at:
> http://people.apache.org/~rmetzger/flink-0.10.1-rc1/
>
> The release artifacts are signed with the key with fingerprint  D9839159:
> http://www.apache.org/dist/flink/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapacheflink-1058
>
> -
>
> The vote is open for the next 72 hours and passes if a majority of at least
> three +1 PMC votes are cast.
>
> The vote ends on Wednesday, November 25.
>
> [ ] +1 Release this package as Apache Flink 0.10.1
> [ ] -1 Do not release this package because ...
>
> ===


[jira] [Created] (FLINK-3079) Add utility for measuring the raw read throughput from a Kafka topic

2015-11-25 Thread Robert Metzger (JIRA)
Robert Metzger created FLINK-3079:
-

 Summary: Add utility for measuring the raw read throughput from a 
Kafka topic
 Key: FLINK-3079
 URL: https://issues.apache.org/jira/browse/FLINK-3079
 Project: Flink
  Issue Type: New Feature
  Components: Kafka Connector
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Minor


I would like to add a little example job to the kafka connector package which 
is  measuring the raw read throughput from a kafka topic.

This is helpful when trying to debug the performance when reading from Kafka. 
Such a tool would help finding the bottleneck in a data processing pipeline.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3080) Cannot union a data stream with a product of itself

2015-11-25 Thread Vasia Kalavri (JIRA)
Vasia Kalavri created FLINK-3080:


 Summary: Cannot union a data stream with a product of itself
 Key: FLINK-3080
 URL: https://issues.apache.org/jira/browse/FLINK-3080
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10.0
Reporter: Vasia Kalavri


Currently it is not possible to union a stream with itself or a product of 
itself.
For example, the following code
{{stream.union(stream.map(...))}}
fails with this exception: "A DataStream cannot be unioned with itself".
The documentation currently states that "If you union a data stream with itself 
you will still only get each element once." and should also be updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Union a data stream with a product of itself

2015-11-25 Thread Vasiliki Kalavri
Here's the issue: https://issues.apache.org/jira/browse/FLINK-3080

-V.

On 25 November 2015 at 14:38, Gyula Fóra  wrote:

> Yes, please
>
> Vasiliki Kalavri  ezt írta (időpont: 2015. nov.
> 25., Sze, 14:37):
>
> > So, do we all agree that the current behavior is not correct? Shall I
> open
> > a JIRA about this?
> >
> > On 25 November 2015 at 13:58, Gyula Fóra  wrote:
> >
> > > Well it kind of depends on what definition of union are we using. If
> this
> > > is a union in a set theoretical way we can argue that the union of a
> > stream
> > > with itself should be the same stream because it contains exactly the
> > same
> > > elements with the same timestamps and lineage.
> > >
> > > On the other hand stream and stream.map(id) are not exactly the same as
> > > they might have elements with different order (the lineage differs).
> > >
> > > So I wouldnt say that any self-union semantics is the only possible
> one.
> > >
> > > Gyula
> > >
> > > Bruecke, Christoph  ezt írta
> > > (időpont: 2015. nov. 25., Sze, 13:47):
> > >
> > > > Hi,
> > > >
> > > > the operation “stream.union(stream.map(id))” is equivalent to
> > > > “stream.union(stream)” isn’t it? So it might also duplicate the data.
> > > >
> > > > - Christoph
> > > >
> > > >
> > > > > On 25 Nov 2015, at 11:24, Stephan Ewen  wrote:
> > > > >
> > > > > "stream.union(stream.map(..))" should definitely be possible. Not
> > sure
> > > > why
> > > > > this is not permitted.
> > > > >
> > > > > "stream.union(stream)" would contain each element twice, so should
> > > either
> > > > > give an error or actually union (or duplicate) elements...
> > > > >
> > > > > Stephan
> > > > >
> > > > >
> > > > > On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra 
> > > wrote:
> > > > >
> > > > >> Yes, I am not sure if this the intentional behaviour. I think you
> > are
> > > > >> supposed to be able to do the things you described.
> > > > >>
> > > > >> stream.union(stream.map(..)) and things like this are fair
> > operations.
> > > > Also
> > > > >> maybe stream.union(stream) should just give stream instead of an
> > > error.
> > > > >>
> > > > >> Could someone comment on this who knows the reasoning behind the
> > > current
> > > > >> mechanics?
> > > > >>
> > > > >> Gyula
> > > > >>
> > > > >> Vasiliki Kalavri  ezt írta (időpont:
> > 2015.
> > > > nov.
> > > > >> 24., K, 16:46):
> > > > >>
> > > > >>> Hi squirrels,
> > > > >>>
> > > > >>> when porting the gelly streaming code from 0.9 to 0.10 today with
> > > > Paris,
> > > > >> we
> > > > >>> hit an exception in union: "*A DataStream cannot be unioned with
> > > > >> itself*".
> > > > >>>
> > > > >>> The code raising this exception looks like this:
> > > > >>> stream.union(stream.map(...)).
> > > > >>>
> > > > >>> Taking a look into the union code, we see that it's now not
> allowed
> > > to
> > > > >>> union a stream, not only with itself, but with any product of
> > itself.
> > > > >>>
> > > > >>> First, we are wondering, why is that? Does it make building the
> > > stream
> > > > >>> graph easier in some way?
> > > > >>> Second, we might want to give a better error message there, e.g.
> > "*A
> > > > >>> DataStream cannot be unioned with itself or a product of
> itself*",
> > > and
> > > > >>> finally, we should update the docs, which currently state that
> > union
> > > a
> > > > >>> stream with itself is allowed and that "*If you union a data
> stream
> > > > with
> > > > >>> itself you will still only get each element once.*"
> > > > >>>
> > > > >>> Cheers,
> > > > >>> -Vasia.
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>


Re: Null Pointer Exception in tests but only in COLLECTION mode

2015-11-25 Thread Maximilian Michels
Hi Martin,

Great. Thanks for the fix!

Cheers,
Max

On Tue, Nov 24, 2015 at 7:40 PM, Martin Junghanns
 wrote:
> Hi Max,
>
> fixed in https://github.com/apache/flink/pull/1396
>
> Best,
> Martin
>
>
> On 24.11.2015 13:46, Maximilian Michels wrote:
>>
>> Hi André, hi Martin,
>>
>> This looks very much like a bug. Martin, I would be happy if you
>> opened a JIRA issue.
>>
>> Thanks,
>> Max
>>
>> On Sun, Nov 22, 2015 at 12:27 PM, Martin Junghanns
>>  wrote:
>>>
>>> Hi,
>>>
>>> What he meant was MultipleProgramsTestBase, not FlinkTestBase.
>>>
>>> I debugged this a bit.
>>>
>>> The NPE is thrown in
>>>
>>>
>>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java#L296
>>>
>>> since current can be null if the input iterator is empty.
>>>
>>> In Cluster Execution, it is checked that the output of the previous
>>> function
>>> (e.g. Filter) is not empty in:
>>>
>>>
>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/AllGroupReduceDriver.java#L144
>>>
>>> which avoids going into AggregateOperator and getting a NPE.
>>>
>>> However, in Collection Mode, the execution is not grouped (don't know
>>> why,
>>> yet). In
>>>
>>>
>>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/operators/base/GroupReduceOperatorBase.java#L207
>>>
>>> the copied input data is handed over to the aggregate function which
>>> leads
>>> to the NPE.
>>>
>>> Checking inputDataCopy.size() > 0 before calling the aggregate solves the
>>> problem.
>>>
>>> If someone can confirm that this is not a more generic problem, I would
>>> open
>>> an issue and a PR.
>>>
>>> Best,
>>> Martin
>>>
>>>
>>> On 20.11.2015 18:41, André Petermann wrote:


 Hi all,

 during a workflow, a data set may run empty, e.g., because of a join
 without matches.

 We're using FlinkTestBase and found out, that aggregate functions on
 empty data sets work fine in CLUSTER execution mode but cause a Null
 Pointer Exception at AggregateOperator$AggregatingUdf in COLLECTION
 mode.

 Here is the minimal example on 1.0-SNAPSHOT:
 https://gist.github.com/p3et/59a65bab11098dd11054

 Are we doing something wrong, or is this a bug?

 Cheers,
 Andre

>>>
>


RE: The null in Flink

2015-11-25 Thread Li, Chengxiang
Hi
In this mail list, there are some discussions about null value handling in 
Flink, and I saw several related JIRAs as well(like FLINK-2203, FLINK-2210), 
but unfortunately, got reverted due to immature design, and no further action 
since then. I would like to pick this topic up here, as it's quite an important 
part of data analysis and many features depend on it. Hopefully, through a 
plenary discussion, we can generate an acceptable solution and move forward. 
Stephan has explained very clearly about how and why Flink handle "Null values 
in the Programming Language APIs", so I mainly talk about the second part of 
"Null values in the high-level (logical) APIs ".

1. Why should Flink support Null values handling in Table API?
i.  Data source may miss column value in many cases, if no Null values 
handling in Table API, user need to write an extra ETL to handle missing values 
manually.
ii. Some Table API operators generate Null values on their own, like 
Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling in Table 
API is the prerequisite of these features.

2. The semantic of Null value handling in Table API.
Fortunately, there are already mature DBMS  standards we can follow for Null 
value handling, I list several semantic of Null value handling here. To be 
noted that, this may not cover all the cases, and the semantics may vary in 
different DBMSs, so it should totally open to discuss.
I,  NULL compare. In ascending order, NULL is smaller than any other 
value, and NULL == NULL return false. 
ii. NULL exists in GroupBy Key, all NULL values are grouped as a single 
group.
iii. NULL exists in Aggregate columns, ignore NULL in aggregation 
function.
iv. NULL exists in both side Join key, refer to #i, NULL == 
NULL return false, no output for NULL Join key.
v.  NULL in Scalar expression, expression within NULL(eg. 1 + 
NULL) return NULL. 
vi. NULL in Boolean expression, add an extra result: UNKNOWN, 
more semantic for Boolean expression in reference #1.
vii. More related function support, like COALESCE, NVL, NANVL, 
and so on.

3. NULL value storage in Table API.
  Just set null to Row field value. To mark NULL value in serialized binary 
record data, normally it use extra flag for each field to mark whether its 
value is NULL, which would change the data layout of Row object. So any logic 
that access serialized Row data directly should updated to sync with new data 
layout, for example, many methods in RowComparator.

Reference:
1. Nulls: Nothing to worry about: 
http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html.
2. Null related functions: 
https://oracle-base.com/articles/misc/null-related-functions

-Original Message-
From: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] On Behalf Of Stephan 
Ewen
Sent: Thursday, June 18, 2015 8:43 AM
To: dev@flink.apache.org
Subject: Re: The null in Flink

Hi!

I think we actually have two discussions here, both of them important:

--
1) Null values in the Programming Language APIs
--

Fields in composite types may simply be null pointers.

In object types:
  - primitives members are naturally non-nullable
  - all other members are nullable

=> If you want to avoid the overhead of nullability, go with primitive types.

In Tuples, and derives types (Scala case classes):
  - Fields are non-nullable.

=> The reason here is that we initially decided to keep tuples as a very fast 
data type. Because tuples cannot hold primitives in Java/Scala, we would not 
have a way to make fast non-nullable fields. The performance of nullable fields 
affects the key-operations, especially on normalized keys.
We can work around that with some effort, but have not one it so far.

=> In Scala, the Option types is a natural way of elegantly working around that.


--
2) Null values in the high-level (logial) APIs
--

This is mainly what Ted was referring to, if I understood him correctly.

Here, we need to figure out what form of semantical null values in the Table 
API and later, in SQL.

Besides deciding what semantics to follow here in the logical APIs, we need to 
decide what these values confert to/from when switching between 
logical/physical APIs.






On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning  wrote:

> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels 
> wrote:
>
> > Just to give an idea what null values could cause in Flink:
> DataSet.count()
> > returns the number of elements of all values in a Dataset (null or 
> > not) while #834 would ignore null values and aggregate the DataSet 
> > without
> them.
> >
>
> Compare R's 

Re: could you please add me Contributor list?

2015-11-25 Thread Robert Metzger
I added you as a contributor.

On Wed, Nov 25, 2015 at 7:29 AM, jun aoki  wrote:

> Hi Henry, thank you for helping. My id is jaoki
>
> On Tue, Nov 24, 2015 at 9:47 PM, Henry Saputra 
> wrote:
>
> > Hi Jun,
> >
> > What is your JIRA username?
> >
> > But for now, you can always work on JIRA issue without assigning to
> > yourself. Just add to comment that you are planning to work on it.
> >
> > - Henry
> >
> > On Tue, Nov 24, 2015 at 5:18 PM, jun aoki  wrote:
> > > So that I can assign myself to Flink jiras?
> > >
> > > --
> > > -jun
> >
>
>
>
> --
> -jun
>


Re: Union a data stream with a product of itself

2015-11-25 Thread Gyula Fóra
Yes, I am not sure if this the intentional behaviour. I think you are
supposed to be able to do the things you described.

stream.union(stream.map(..)) and things like this are fair operations. Also
maybe stream.union(stream) should just give stream instead of an error.

Could someone comment on this who knows the reasoning behind the current
mechanics?

Gyula

Vasiliki Kalavri  ezt írta (időpont: 2015. nov.
24., K, 16:46):

> Hi squirrels,
>
> when porting the gelly streaming code from 0.9 to 0.10 today with Paris, we
> hit an exception in union: "*A DataStream cannot be unioned with itself*".
>
> The code raising this exception looks like this:
> stream.union(stream.map(...)).
>
> Taking a look into the union code, we see that it's now not allowed to
> union a stream, not only with itself, but with any product of itself.
>
> First, we are wondering, why is that? Does it make building the stream
> graph easier in some way?
> Second, we might want to give a better error message there, e.g. "*A
> DataStream cannot be unioned with itself or a product of itself*", and
> finally, we should update the docs, which currently state that union a
> stream with itself is allowed and that "*If you union a data stream with
> itself you will still only get each element once.*"
>
> Cheers,
> -Vasia.
>


Re: Union a data stream with a product of itself

2015-11-25 Thread Stephan Ewen
"stream.union(stream.map(..))" should definitely be possible. Not sure why
this is not permitted.

"stream.union(stream)" would contain each element twice, so should either
give an error or actually union (or duplicate) elements...

Stephan


On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra  wrote:

> Yes, I am not sure if this the intentional behaviour. I think you are
> supposed to be able to do the things you described.
>
> stream.union(stream.map(..)) and things like this are fair operations. Also
> maybe stream.union(stream) should just give stream instead of an error.
>
> Could someone comment on this who knows the reasoning behind the current
> mechanics?
>
> Gyula
>
> Vasiliki Kalavri  ezt írta (időpont: 2015. nov.
> 24., K, 16:46):
>
> > Hi squirrels,
> >
> > when porting the gelly streaming code from 0.9 to 0.10 today with Paris,
> we
> > hit an exception in union: "*A DataStream cannot be unioned with
> itself*".
> >
> > The code raising this exception looks like this:
> > stream.union(stream.map(...)).
> >
> > Taking a look into the union code, we see that it's now not allowed to
> > union a stream, not only with itself, but with any product of itself.
> >
> > First, we are wondering, why is that? Does it make building the stream
> > graph easier in some way?
> > Second, we might want to give a better error message there, e.g. "*A
> > DataStream cannot be unioned with itself or a product of itself*", and
> > finally, we should update the docs, which currently state that union a
> > stream with itself is allowed and that "*If you union a data stream with
> > itself you will still only get each element once.*"
> >
> > Cheers,
> > -Vasia.
> >
>


Re: The null in Flink

2015-11-25 Thread Timo Walther

Hi Chengxiang,

I totally agree that the Table API should fully support NULL values. The 
Table API is a logical API and therefore we should be as close to ANSI 
SQL as possible. Rows need to be nullable in the near future.


2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much 
like SQL magic. I think all other SQL magic (DBMS specific corner cases) 
should be handled by the SQL API on top of the Table API.


Regards,
Timo


On 25.11.2015 11:31, Li, Chengxiang wrote:

Hi
In this mail list, there are some discussions about null value handling in Flink, and I saw several 
related JIRAs as well(like FLINK-2203, FLINK-2210), but unfortunately, got reverted due to immature 
design, and no further action since then. I would like to pick this topic up here, as it's quite an 
important part of data analysis and many features depend on it. Hopefully, through a plenary 
discussion, we can generate an acceptable solution and move forward. Stephan has explained very 
clearly about how and why Flink handle "Null values in the Programming Language APIs", so 
I mainly talk about the second part of "Null values in the high-level (logical) APIs ".

1. Why should Flink support Null values handling in Table API?
i.  Data source may miss column value in many cases, if no Null values 
handling in Table API, user need to write an extra ETL to handle missing values 
manually.
ii. Some Table API operators generate Null values on their own, like 
Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling in Table 
API is the prerequisite of these features.

2. The semantic of Null value handling in Table API.
Fortunately, there are already mature DBMS  standards we can follow for Null 
value handling, I list several semantic of Null value handling here. To be 
noted that, this may not cover all the cases, and the semantics may vary in 
different DBMSs, so it should totally open to discuss.
I,  NULL compare. In ascending order, NULL is smaller than any other 
value, and NULL == NULL return false.
ii. NULL exists in GroupBy Key, all NULL values are grouped as a single 
group.
iii. NULL exists in Aggregate columns, ignore NULL in aggregation 
function.
 iv. NULL exists in both side Join key, refer to #i, NULL == 
NULL return false, no output for NULL Join key.
 v.  NULL in Scalar expression, expression within NULL(eg. 1 + 
NULL) return NULL.
 vi. NULL in Boolean expression, add an extra result: UNKNOWN, 
more semantic for Boolean expression in reference #1.
 vii. More related function support, like COALESCE, NVL, NANVL, 
and so on.

3. NULL value storage in Table API.
   Just set null to Row field value. To mark NULL value in serialized binary 
record data, normally it use extra flag for each field to mark whether its 
value is NULL, which would change the data layout of Row object. So any logic 
that access serialized Row data directly should updated to sync with new data 
layout, for example, many methods in RowComparator.

Reference:
1. Nulls: Nothing to worry about: 
http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html.
2. Null related functions: 
https://oracle-base.com/articles/misc/null-related-functions

-Original Message-
From: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] On Behalf Of Stephan 
Ewen
Sent: Thursday, June 18, 2015 8:43 AM
To: dev@flink.apache.org
Subject: Re: The null in Flink

Hi!

I think we actually have two discussions here, both of them important:

--
1) Null values in the Programming Language APIs
--

Fields in composite types may simply be null pointers.

In object types:
   - primitives members are naturally non-nullable
   - all other members are nullable

=> If you want to avoid the overhead of nullability, go with primitive types.

In Tuples, and derives types (Scala case classes):
   - Fields are non-nullable.

=> The reason here is that we initially decided to keep tuples as a very fast 
data type. Because tuples cannot hold primitives in Java/Scala, we would not have 
a way to make fast non-nullable fields. The performance of nullable fields affects 
the key-operations, especially on normalized keys.
We can work around that with some effort, but have not one it so far.

=> In Scala, the Option types is a natural way of elegantly working around that.


--
2) Null values in the high-level (logial) APIs
--

This is mainly what Ted was referring to, if I understood him correctly.

Here, we need to figure out what form of semantical null values in the Table 
API and later, in SQL.

Besides deciding what semantics to follow here in the logical APIs, we need to 
decide what these values confert 

[jira] [Created] (FLINK-3077) Add "version" command to CliFrontend for showing the version of the installation

2015-11-25 Thread Robert Metzger (JIRA)
Robert Metzger created FLINK-3077:
-

 Summary: Add "version" command to CliFrontend for showing the 
version of the installation
 Key: FLINK-3077
 URL: https://issues.apache.org/jira/browse/FLINK-3077
 Project: Flink
  Issue Type: Improvement
  Components: Command-line client
Reporter: Robert Metzger
 Fix For: 1.0.0


I have the bin directory of Flink in my $PATH variable, so I can just do "flink 
run" on the command line for executing stuff.
However, I have multiple Flink versions locally and its hard to find out which 
installation the bash is picking in the end.

adding a simple "version" command will resolve that issue and I consider it 
helpful in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3078) JobManager does not shutdown when checkpointed jobs are running

2015-11-25 Thread Robert Metzger (JIRA)
Robert Metzger created FLINK-3078:
-

 Summary: JobManager does not shutdown when checkpointed jobs are 
running
 Key: FLINK-3078
 URL: https://issues.apache.org/jira/browse/FLINK-3078
 Project: Flink
  Issue Type: Bug
  Components: JobManager
Affects Versions: 0.10.0, 0.10.1
Reporter: Robert Metzger


While testing the 0.10.1 release, I found that the JobManager does not shutdown 
when I'm stopping it while a streaming job is running.

It seems that the checkpoint coordinator and the execution graph are still 
logging, even though the JobManager actor system and other services have been 
shut down.

This is a log file of an affected JobManager: 
https://gist.github.com/rmetzger/a1532c18eb7081977cee

{code}
11:58:04,406 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Flat Map -> Sink: Unnamed (10/10) (c6544ca6d88e2d1acdec5c838d5fce06) 
switched from CANCELING to FAILED
11:58:04,406 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Kafka Consumer Topology switched from FAILING to RESTARTING.
11:58:04,407 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Delaying retry of job execution for 10 ms ...
11:58:04,417 INFO  org.apache.flink.runtime.blob.BlobServer 
 - Stopped BLOB server at 0.0.0.0:44904
11:58:04,421 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator
 - Shutting down remote daemon.
11:58:04,422 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator
 - Remote daemon shut down; proceeding with flushing remote transports.
11:58:04,446 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator
 - Remoting shut down.
11:58:04,473 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
 - Removing web root dir /tmp/flink-web-2039bed3-d9f9-4950-83ab-6fb70f7fc302
11:58:04,590 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Triggering checkpoint 66 @ 1448452684590
11:58:04,590 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/10) is not being 
executed at the moment. Aborting checkpoint.
11:58:05,091 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Triggering checkpoint 67 @ 1448452685091
11:58:05,091 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/10) is not being 
executed at the moment. Aborting checkpoint.
11:58:05,590 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Triggering checkpoint 68 @ 1448452685590
11:58:05,590 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/10) is not being 
executed at the moment. Aborting checkpoint.
11:58:06,090 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Triggering checkpoint 69 @ 1448452686090
11:58:06,091 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/10) is not being 
executed at the moment. Aborting checkpoint.
11:58:06,590 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Triggering checkpoint 70 @ 1448452686590
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Release Apache Flink 0.10.1 (release-0.10.0-rc1)

2015-11-25 Thread Robert Metzger
+1

- Build a maven project with the staging repository
- started Flink on YARN on a CDH 5.4.5 / Hadoop 2.6.0-cdh5.4.5 cluster with
YARN and HDFS HA
- ran some kafka (0.8.2.0) read / write experiments
- job cancellation with yarn is working ;)

I found the following issue while testing:
https://issues.apache.org/jira/browse/FLINK-3078 but it was already in
0.10.0 and its not super critical bc the JobManager container will be
killed by YARN after a few minutes.


I'll extend the vote until tomorrow Thursday, November 26.


On Tue, Nov 24, 2015 at 1:54 PM, Stephan Ewen  wrote:

> @Gyula: I think it affects users, so should definitely be fixed very soon
> (either 0.10.1 or 0.10.2)
>
> Still checking whether Robert's current version fix solves it now, or
> not...
>
> On Tue, Nov 24, 2015 at 1:46 PM, Vyacheslav Zholudev <
> vyacheslav.zholu...@gmail.com> wrote:
>
> > I can confirm that the build works fine when increasing a max number of
> > opened files. Sorry for confusion.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-Apache-Flink-0-10-1-release-0-10-0-rc1-tp9296p9327.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> > Nabble.com.
> >
>


Re: Union a data stream with a product of itself

2015-11-25 Thread Gyula Fóra
Well it kind of depends on what definition of union are we using. If this
is a union in a set theoretical way we can argue that the union of a stream
with itself should be the same stream because it contains exactly the same
elements with the same timestamps and lineage.

On the other hand stream and stream.map(id) are not exactly the same as
they might have elements with different order (the lineage differs).

So I wouldnt say that any self-union semantics is the only possible one.

Gyula

Bruecke, Christoph  ezt írta
(időpont: 2015. nov. 25., Sze, 13:47):

> Hi,
>
> the operation “stream.union(stream.map(id))” is equivalent to
> “stream.union(stream)” isn’t it? So it might also duplicate the data.
>
> - Christoph
>
>
> > On 25 Nov 2015, at 11:24, Stephan Ewen  wrote:
> >
> > "stream.union(stream.map(..))" should definitely be possible. Not sure
> why
> > this is not permitted.
> >
> > "stream.union(stream)" would contain each element twice, so should either
> > give an error or actually union (or duplicate) elements...
> >
> > Stephan
> >
> >
> > On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra  wrote:
> >
> >> Yes, I am not sure if this the intentional behaviour. I think you are
> >> supposed to be able to do the things you described.
> >>
> >> stream.union(stream.map(..)) and things like this are fair operations.
> Also
> >> maybe stream.union(stream) should just give stream instead of an error.
> >>
> >> Could someone comment on this who knows the reasoning behind the current
> >> mechanics?
> >>
> >> Gyula
> >>
> >> Vasiliki Kalavri  ezt írta (időpont: 2015.
> nov.
> >> 24., K, 16:46):
> >>
> >>> Hi squirrels,
> >>>
> >>> when porting the gelly streaming code from 0.9 to 0.10 today with
> Paris,
> >> we
> >>> hit an exception in union: "*A DataStream cannot be unioned with
> >> itself*".
> >>>
> >>> The code raising this exception looks like this:
> >>> stream.union(stream.map(...)).
> >>>
> >>> Taking a look into the union code, we see that it's now not allowed to
> >>> union a stream, not only with itself, but with any product of itself.
> >>>
> >>> First, we are wondering, why is that? Does it make building the stream
> >>> graph easier in some way?
> >>> Second, we might want to give a better error message there, e.g. "*A
> >>> DataStream cannot be unioned with itself or a product of itself*", and
> >>> finally, we should update the docs, which currently state that union a
> >>> stream with itself is allowed and that "*If you union a data stream
> with
> >>> itself you will still only get each element once.*"
> >>>
> >>> Cheers,
> >>> -Vasia.
> >>>
> >>
>
>


Re: Union a data stream with a product of itself

2015-11-25 Thread Vasiliki Kalavri
So, do we all agree that the current behavior is not correct? Shall I open
a JIRA about this?

On 25 November 2015 at 13:58, Gyula Fóra  wrote:

> Well it kind of depends on what definition of union are we using. If this
> is a union in a set theoretical way we can argue that the union of a stream
> with itself should be the same stream because it contains exactly the same
> elements with the same timestamps and lineage.
>
> On the other hand stream and stream.map(id) are not exactly the same as
> they might have elements with different order (the lineage differs).
>
> So I wouldnt say that any self-union semantics is the only possible one.
>
> Gyula
>
> Bruecke, Christoph  ezt írta
> (időpont: 2015. nov. 25., Sze, 13:47):
>
> > Hi,
> >
> > the operation “stream.union(stream.map(id))” is equivalent to
> > “stream.union(stream)” isn’t it? So it might also duplicate the data.
> >
> > - Christoph
> >
> >
> > > On 25 Nov 2015, at 11:24, Stephan Ewen  wrote:
> > >
> > > "stream.union(stream.map(..))" should definitely be possible. Not sure
> > why
> > > this is not permitted.
> > >
> > > "stream.union(stream)" would contain each element twice, so should
> either
> > > give an error or actually union (or duplicate) elements...
> > >
> > > Stephan
> > >
> > >
> > > On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra 
> wrote:
> > >
> > >> Yes, I am not sure if this the intentional behaviour. I think you are
> > >> supposed to be able to do the things you described.
> > >>
> > >> stream.union(stream.map(..)) and things like this are fair operations.
> > Also
> > >> maybe stream.union(stream) should just give stream instead of an
> error.
> > >>
> > >> Could someone comment on this who knows the reasoning behind the
> current
> > >> mechanics?
> > >>
> > >> Gyula
> > >>
> > >> Vasiliki Kalavri  ezt írta (időpont: 2015.
> > nov.
> > >> 24., K, 16:46):
> > >>
> > >>> Hi squirrels,
> > >>>
> > >>> when porting the gelly streaming code from 0.9 to 0.10 today with
> > Paris,
> > >> we
> > >>> hit an exception in union: "*A DataStream cannot be unioned with
> > >> itself*".
> > >>>
> > >>> The code raising this exception looks like this:
> > >>> stream.union(stream.map(...)).
> > >>>
> > >>> Taking a look into the union code, we see that it's now not allowed
> to
> > >>> union a stream, not only with itself, but with any product of itself.
> > >>>
> > >>> First, we are wondering, why is that? Does it make building the
> stream
> > >>> graph easier in some way?
> > >>> Second, we might want to give a better error message there, e.g. "*A
> > >>> DataStream cannot be unioned with itself or a product of itself*",
> and
> > >>> finally, we should update the docs, which currently state that union
> a
> > >>> stream with itself is allowed and that "*If you union a data stream
> > with
> > >>> itself you will still only get each element once.*"
> > >>>
> > >>> Cheers,
> > >>> -Vasia.
> > >>>
> > >>
> >
> >
>


Re: [VOTE] Release Apache Flink 0.10.1 (release-0.10.0-rc1)

2015-11-25 Thread Stephan Ewen
@Till I think the avro test data file is okay, the "no binaries" policy
refers to binary executables, as far as I know.

On Wed, Nov 25, 2015 at 2:54 PM, Till Rohrmann 
wrote:

> Checked checksums for src release and Hadoop 2.7 Scala 2.10 release
>
> Checked binaries in source release
> - contains ./flink-staging/flink-avro/src/test/resources/testdata.avro
>
> License
> - no new files added which are relevant for licensing
>
> Build Flink and run tests from source release for Hadoop 2.5.1
>
> Checked empty that log files don't contain exceptions and out files are
> empty
>
> Run all examples with Hadoop 2.7 Scala 2.10 binaries via FliRTT tool on 4
> node standalone cluster and YARN cluster
>
> Tested planVisualizer
>
> Tested flink command line client
> - tested info command
> - tested -p option
>
> Tested cluster HA in standalone mode => working
>
> Tested cluster HA on YARN (2.7.1) => working
>
> Except for the avro testdata file which is contained in the source release,
> I didn't find anything.
>
> +1 for releasing and removing the testdata file for the next release.
>
> On Wed, Nov 25, 2015 at 2:33 PM, Robert Metzger 
> wrote:
>
> > +1
> >
> > - Build a maven project with the staging repository
> > - started Flink on YARN on a CDH 5.4.5 / Hadoop 2.6.0-cdh5.4.5 cluster
> with
> > YARN and HDFS HA
> > - ran some kafka (0.8.2.0) read / write experiments
> > - job cancellation with yarn is working ;)
> >
> > I found the following issue while testing:
> > https://issues.apache.org/jira/browse/FLINK-3078 but it was already in
> > 0.10.0 and its not super critical bc the JobManager container will be
> > killed by YARN after a few minutes.
> >
> >
> > I'll extend the vote until tomorrow Thursday, November 26.
> >
> >
> > On Tue, Nov 24, 2015 at 1:54 PM, Stephan Ewen  wrote:
> >
> > > @Gyula: I think it affects users, so should definitely be fixed very
> soon
> > > (either 0.10.1 or 0.10.2)
> > >
> > > Still checking whether Robert's current version fix solves it now, or
> > > not...
> > >
> > > On Tue, Nov 24, 2015 at 1:46 PM, Vyacheslav Zholudev <
> > > vyacheslav.zholu...@gmail.com> wrote:
> > >
> > > > I can confirm that the build works fine when increasing a max number
> of
> > > > opened files. Sorry for confusion.
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-Apache-Flink-0-10-1-release-0-10-0-rc1-tp9296p9327.html
> > > > Sent from the Apache Flink Mailing List archive. mailing list archive
> > at
> > > > Nabble.com.
> > > >
> > >
> >
>


Re: Union a data stream with a product of itself

2015-11-25 Thread Bruecke, Christoph
Hi,

the operation “stream.union(stream.map(id))” is equivalent to 
“stream.union(stream)” isn’t it? So it might also duplicate the data.

- Christoph


> On 25 Nov 2015, at 11:24, Stephan Ewen  wrote:
> 
> "stream.union(stream.map(..))" should definitely be possible. Not sure why
> this is not permitted.
> 
> "stream.union(stream)" would contain each element twice, so should either
> give an error or actually union (or duplicate) elements...
> 
> Stephan
> 
> 
> On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra  wrote:
> 
>> Yes, I am not sure if this the intentional behaviour. I think you are
>> supposed to be able to do the things you described.
>> 
>> stream.union(stream.map(..)) and things like this are fair operations. Also
>> maybe stream.union(stream) should just give stream instead of an error.
>> 
>> Could someone comment on this who knows the reasoning behind the current
>> mechanics?
>> 
>> Gyula
>> 
>> Vasiliki Kalavri  ezt írta (időpont: 2015. nov.
>> 24., K, 16:46):
>> 
>>> Hi squirrels,
>>> 
>>> when porting the gelly streaming code from 0.9 to 0.10 today with Paris,
>> we
>>> hit an exception in union: "*A DataStream cannot be unioned with
>> itself*".
>>> 
>>> The code raising this exception looks like this:
>>> stream.union(stream.map(...)).
>>> 
>>> Taking a look into the union code, we see that it's now not allowed to
>>> union a stream, not only with itself, but with any product of itself.
>>> 
>>> First, we are wondering, why is that? Does it make building the stream
>>> graph easier in some way?
>>> Second, we might want to give a better error message there, e.g. "*A
>>> DataStream cannot be unioned with itself or a product of itself*", and
>>> finally, we should update the docs, which currently state that union a
>>> stream with itself is allowed and that "*If you union a data stream with
>>> itself you will still only get each element once.*"
>>> 
>>> Cheers,
>>> -Vasia.
>>> 
>> 



Re: [VOTE] Release Apache Flink 0.10.1 (release-0.10.0-rc1)

2015-11-25 Thread Stephan Ewen
+1

 - License and Notice are good
 - ran all tests (including manual tests) work for hadoop 2.3.0 - Scala 2.10
 - ran all tests for hadoop 2.7.0 - Scala 2.11
 - ran all examples, several on larger external data
 - checked web frontend
 - checked quickstart archetypes


On Tue, Nov 24, 2015 at 1:54 PM, Stephan Ewen  wrote:

> @Gyula: I think it affects users, so should definitely be fixed very soon
> (either 0.10.1 or 0.10.2)
>
> Still checking whether Robert's current version fix solves it now, or
> not...
>
> On Tue, Nov 24, 2015 at 1:46 PM, Vyacheslav Zholudev <
> vyacheslav.zholu...@gmail.com> wrote:
>
>> I can confirm that the build works fine when increasing a max number of
>> opened files. Sorry for confusion.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-Apache-Flink-0-10-1-release-0-10-0-rc1-tp9296p9327.html
>> Sent from the Apache Flink Mailing List archive. mailing list archive at
>> Nabble.com.
>>
>
>


Re: [VOTE] Release Apache Flink 0.10.1 (release-0.10.0-rc1)

2015-11-25 Thread Till Rohrmann
Checked checksums for src release and Hadoop 2.7 Scala 2.10 release

Checked binaries in source release
- contains ./flink-staging/flink-avro/src/test/resources/testdata.avro

License
- no new files added which are relevant for licensing

Build Flink and run tests from source release for Hadoop 2.5.1

Checked empty that log files don't contain exceptions and out files are
empty

Run all examples with Hadoop 2.7 Scala 2.10 binaries via FliRTT tool on 4
node standalone cluster and YARN cluster

Tested planVisualizer

Tested flink command line client
- tested info command
- tested -p option

Tested cluster HA in standalone mode => working

Tested cluster HA on YARN (2.7.1) => working

Except for the avro testdata file which is contained in the source release,
I didn't find anything.

+1 for releasing and removing the testdata file for the next release.

On Wed, Nov 25, 2015 at 2:33 PM, Robert Metzger  wrote:

> +1
>
> - Build a maven project with the staging repository
> - started Flink on YARN on a CDH 5.4.5 / Hadoop 2.6.0-cdh5.4.5 cluster with
> YARN and HDFS HA
> - ran some kafka (0.8.2.0) read / write experiments
> - job cancellation with yarn is working ;)
>
> I found the following issue while testing:
> https://issues.apache.org/jira/browse/FLINK-3078 but it was already in
> 0.10.0 and its not super critical bc the JobManager container will be
> killed by YARN after a few minutes.
>
>
> I'll extend the vote until tomorrow Thursday, November 26.
>
>
> On Tue, Nov 24, 2015 at 1:54 PM, Stephan Ewen  wrote:
>
> > @Gyula: I think it affects users, so should definitely be fixed very soon
> > (either 0.10.1 or 0.10.2)
> >
> > Still checking whether Robert's current version fix solves it now, or
> > not...
> >
> > On Tue, Nov 24, 2015 at 1:46 PM, Vyacheslav Zholudev <
> > vyacheslav.zholu...@gmail.com> wrote:
> >
> > > I can confirm that the build works fine when increasing a max number of
> > > opened files. Sorry for confusion.
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-Apache-Flink-0-10-1-release-0-10-0-rc1-tp9296p9327.html
> > > Sent from the Apache Flink Mailing List archive. mailing list archive
> at
> > > Nabble.com.
> > >
> >
>


Re: [VOTE] Release Apache Flink 0.10.1 (release-0.10.0-rc1)

2015-11-25 Thread Till Rohrmann
Alright, then I withdraw my remark concerning testdata.avro.

On Wed, Nov 25, 2015 at 2:56 PM, Stephan Ewen  wrote:

> @Till I think the avro test data file is okay, the "no binaries" policy
> refers to binary executables, as far as I know.
>
> On Wed, Nov 25, 2015 at 2:54 PM, Till Rohrmann 
> wrote:
>
> > Checked checksums for src release and Hadoop 2.7 Scala 2.10 release
> >
> > Checked binaries in source release
> > - contains ./flink-staging/flink-avro/src/test/resources/testdata.avro
> >
> > License
> > - no new files added which are relevant for licensing
> >
> > Build Flink and run tests from source release for Hadoop 2.5.1
> >
> > Checked empty that log files don't contain exceptions and out files are
> > empty
> >
> > Run all examples with Hadoop 2.7 Scala 2.10 binaries via FliRTT tool on 4
> > node standalone cluster and YARN cluster
> >
> > Tested planVisualizer
> >
> > Tested flink command line client
> > - tested info command
> > - tested -p option
> >
> > Tested cluster HA in standalone mode => working
> >
> > Tested cluster HA on YARN (2.7.1) => working
> >
> > Except for the avro testdata file which is contained in the source
> release,
> > I didn't find anything.
> >
> > +1 for releasing and removing the testdata file for the next release.
> >
> > On Wed, Nov 25, 2015 at 2:33 PM, Robert Metzger 
> > wrote:
> >
> > > +1
> > >
> > > - Build a maven project with the staging repository
> > > - started Flink on YARN on a CDH 5.4.5 / Hadoop 2.6.0-cdh5.4.5 cluster
> > with
> > > YARN and HDFS HA
> > > - ran some kafka (0.8.2.0) read / write experiments
> > > - job cancellation with yarn is working ;)
> > >
> > > I found the following issue while testing:
> > > https://issues.apache.org/jira/browse/FLINK-3078 but it was already in
> > > 0.10.0 and its not super critical bc the JobManager container will be
> > > killed by YARN after a few minutes.
> > >
> > >
> > > I'll extend the vote until tomorrow Thursday, November 26.
> > >
> > >
> > > On Tue, Nov 24, 2015 at 1:54 PM, Stephan Ewen 
> wrote:
> > >
> > > > @Gyula: I think it affects users, so should definitely be fixed very
> > soon
> > > > (either 0.10.1 or 0.10.2)
> > > >
> > > > Still checking whether Robert's current version fix solves it now, or
> > > > not...
> > > >
> > > > On Tue, Nov 24, 2015 at 1:46 PM, Vyacheslav Zholudev <
> > > > vyacheslav.zholu...@gmail.com> wrote:
> > > >
> > > > > I can confirm that the build works fine when increasing a max
> number
> > of
> > > > > opened files. Sorry for confusion.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-Apache-Flink-0-10-1-release-0-10-0-rc1-tp9296p9327.html
> > > > > Sent from the Apache Flink Mailing List archive. mailing list
> archive
> > > at
> > > > > Nabble.com.
> > > > >
> > > >
> > >
> >
>


RE: The null in Flink

2015-11-25 Thread Li, Chengxiang
Thanks, Timo. 
We may put the NULL related function support to SQL API, but for Scalar 
expression and Boolean expression, it already been supported in Table API, 
without NULL value handling support, query with Scalar expression and Boolean 
expression would fail while encounter NULL value.

Thanks
Chengxiang 

-Original Message-
From: Timo Walther [mailto:twal...@apache.org] 
Sent: Wednesday, November 25, 2015 7:33 PM
To: dev@flink.apache.org
Subject: Re: The null in Flink

Hi Chengxiang,

I totally agree that the Table API should fully support NULL values. The Table 
API is a logical API and therefore we should be as close to ANSI SQL as 
possible. Rows need to be nullable in the near future.

2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much like SQL 
magic. I think all other SQL magic (DBMS specific corner cases) should be 
handled by the SQL API on top of the Table API.

Regards,
Timo


On 25.11.2015 11:31, Li, Chengxiang wrote:
> Hi
> In this mail list, there are some discussions about null value handling in 
> Flink, and I saw several related JIRAs as well(like FLINK-2203, FLINK-2210), 
> but unfortunately, got reverted due to immature design, and no further action 
> since then. I would like to pick this topic up here, as it's quite an 
> important part of data analysis and many features depend on it. Hopefully, 
> through a plenary discussion, we can generate an acceptable solution and move 
> forward. Stephan has explained very clearly about how and why Flink handle 
> "Null values in the Programming Language APIs", so I mainly talk about the 
> second part of "Null values in the high-level (logical) APIs ".
>
> 1. Why should Flink support Null values handling in Table API?
>   i.  Data source may miss column value in many cases, if no Null values 
> handling in Table API, user need to write an extra ETL to handle missing 
> values manually.
>   ii. Some Table API operators generate Null values on their own, like 
> Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling in Table 
> API is the prerequisite of these features.
>
> 2. The semantic of Null value handling in Table API.
> Fortunately, there are already mature DBMS  standards we can follow for Null 
> value handling, I list several semantic of Null value handling here. To be 
> noted that, this may not cover all the cases, and the semantics may vary in 
> different DBMSs, so it should totally open to discuss.
>   I,  NULL compare. In ascending order, NULL is smaller than any other 
> value, and NULL == NULL return false.
>   ii. NULL exists in GroupBy Key, all NULL values are grouped as a single 
> group.
>   iii. NULL exists in Aggregate columns, ignore NULL in aggregation 
> function.
>  iv. NULL exists in both side Join key, refer to #i, NULL == 
> NULL return false, no output for NULL Join key.
>  v.  NULL in Scalar expression, expression within NULL(eg. 1 
> + NULL) return NULL.
>  vi. NULL in Boolean expression, add an extra result: 
> UNKNOWN, more semantic for Boolean expression in reference #1.
>  vii. More related function support, like COALESCE, NVL, 
> NANVL, and so on.
>
> 3. NULL value storage in Table API.
>Just set null to Row field value. To mark NULL value in serialized binary 
> record data, normally it use extra flag for each field to mark whether its 
> value is NULL, which would change the data layout of Row object. So any logic 
> that access serialized Row data directly should updated to sync with new data 
> layout, for example, many methods in RowComparator.
>
> Reference:
> 1. Nulls: Nothing to worry about: 
> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html.
> 2. Null related functions: 
> https://oracle-base.com/articles/misc/null-related-functions
>
> -Original Message-
> From: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] On Behalf 
> Of Stephan Ewen
> Sent: Thursday, June 18, 2015 8:43 AM
> To: dev@flink.apache.org
> Subject: Re: The null in Flink
>
> Hi!
>
> I think we actually have two discussions here, both of them important:
>
> --
> 1) Null values in the Programming Language APIs
> --
>
> Fields in composite types may simply be null pointers.
>
> In object types:
>- primitives members are naturally non-nullable
>- all other members are nullable
>
> => If you want to avoid the overhead of nullability, go with primitive types.
>
> In Tuples, and derives types (Scala case classes):
>- Fields are non-nullable.
>
> => The reason here is that we initially decided to keep tuples as a very fast 
> data type. Because tuples cannot hold primitives in Java/Scala, we would not 
> have a way to make fast non-nullable fields. The performance of nullable 
> fields affects the key-operations, especially on