Re: Parallelizing ExecutionConfig.fromCollection

2016-04-25 Thread Greg Hogan
Hi Till,

I appreciate the detailed explanation. My specific case has been with the
graph generators. I think it is possible to implement some random sources
using SplittableIterator rather than building a Collection, so it might be
best to rework the graph generator API to better fit the Flink model. For
LCGs we can simply build a skip-ahead table.

Greg

On Mon, Apr 25, 2016 at 10:05 AM, Till Rohrmann 
wrote:

> Hi Greg,
>
> I think we haven't discussed the opportunity for a parallelized collection
> input format, yet. Thanks for bringing this up.
>
> I think it should be possible to implement a generic parallel collection
> input format. However, I have two questions here:
>
> 1. Is it really a problem for users that their job exceeds the akka frame
> size limit when using the collection input format? The collection input
> format should be used primarily for testing and, thus, the data should be
> rather small.
>
> 2. Which message is causing the frame size problem? If it is the task
> deployment descriptor, then a parallelized collection input format which
> works with input splits can solve the problem. If the problem is rather the
> `SubmitJob` message, then we have to tackle the problem differently. The
> reason is that the input splits are only created on the `JobManager`.
> Before, the collection is simply written into the task config of the data
> source `JobVertex`, because we don't know the number of sub tasks yet. In
> the latter case, which can also be cause by large closure objects, we
> should send the job via the blob manager to the `JobManager` to solve the
> problem.
>
> Cheers,
> Till
>
> On Mon, Apr 25, 2016 at 3:45 PM, Greg Hogan  wrote:
>
> > Hi,
> >
> > CollectionInputFormat currently enforces a parallelism of 1 by
> implementing
> > NonParallelInput and serializing the entire Collection. If my
> understanding
> > is correct this serialized InputFormat is often the cause of a new job
> > exceeding the akka message size limit.
> >
> > As an alternative the Collection elements could be serialized into
> multiple
> > InputSplits. Has this idea been considered and rejected?
> >
> > Thanks,
> > Greg
> >
>


[jira] [Created] (FLINK-3811) Refactor ExecutionEnvironment in TableEnvironment

2016-04-25 Thread Fabian Hueske (JIRA)
Fabian Hueske created FLINK-3811:


 Summary: Refactor ExecutionEnvironment in TableEnvironment
 Key: FLINK-3811
 URL: https://issues.apache.org/jira/browse/FLINK-3811
 Project: Flink
  Issue Type: Improvement
  Components: Table API
Reporter: Fabian Hueske
Assignee: Fabian Hueske
Priority: Minor
 Fix For: 1.1.0


Currently, the Scala BatchTableEnvironment has a reference to a Scala 
ExecutionEnvironment and the Java BatchTableEnvironment uses the Java 
ExecutionEnvironment. The same applies to their streaming counterparts.
This requires special implementations for Java / Scala for instance to create 
new data sources.

I propose to refactor the TableEnvironments such that only the Java execution 
environments for batch and streaming are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Pull request failed Travis...what's next?

2016-04-25 Thread Fabian Hueske
Hi Sourigna,

usually a PR is picked up and reviewed by somebody from the community and
eventually merged by a committer.
Sometimes it takes a few days, but if nobody reacts it helps to ping just
like you did.
I'll have a look at your PR tomorrow.

Thanks, Fabian

2016-04-25 18:01 GMT+02:00 Sourigna Phetsarath :

> Who's responsible for merging the PRs?  What's the usual timeline for
> feedback and/or merging?
>
> Thank you.
>
> On Thu, Apr 21, 2016 at 6:09 PM, Flavio Pompermaier 
> wrote:
>
> > We just issued a PR about FLINK-1827 (
> > https://github.com/apache/flink/pull/1915) that improves test stability
> > except for the ml library that has still some problem to solve..
> > On 21 Apr 2016 23:59, "Fabian Hueske"  wrote:
> >
> > > Hi Sourigna,
> > >
> > > thanks for contributing!
> > >
> > > Unrelated test failures are not a problem. We have some issues with
> build
> > > stability on Travis.
> > > We know which tests are failing from time to time and run the tests
> again
> > > before merging.
> > >
> > > Thanks
> > >
> > > 2016-04-21 23:18 GMT+02:00 Sourigna Phetsarath <
> > gna.phetsar...@teamaol.com
> > > >:
> > >
> > > > All:
> > > >
> > > > I created this pull request and it failed on the flink-ml module
> that I
> > > > didn't touch.  What are next steps?
> > > >
> > > > https://github.com/apache/flink/pull/1920
> > > >
> > > > Thank you for any help.
> > > >
> > > > --
> > > >
> > > >
> > > > *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> > > > Applied Research Chapter
> > > > 770 Broadway, 5th Floor, New York, NY 10003
> > > > o: 212.402.4871 // m: 917.373.7363
> > > > vvmr: 8890237 aim: sphetsarath20 t: @sourigna
> > > >
> > > > * *
> > > >
> > >
> >
>
>
>
> --
>
>
> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> Applied Research Chapter
> 770 Broadway, 5th Floor, New York, NY 10003
> o: 212.402.4871 // m: 917.373.7363
> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>
> * *
>


[jira] [Created] (FLINK-3810) Missing break in ZooKeeperLeaderElectionService#handleStateChange()

2016-04-25 Thread Ted Yu (JIRA)
Ted Yu created FLINK-3810:
-

 Summary: Missing break in 
ZooKeeperLeaderElectionService#handleStateChange()
 Key: FLINK-3810
 URL: https://issues.apache.org/jira/browse/FLINK-3810
 Project: Flink
  Issue Type: Bug
Reporter: Ted Yu


{code}
  protected void handleStateChange(ConnectionState newState) {
switch (newState) {
  case CONNECTED:
LOG.debug("Connected to ZooKeeper quorum. Leader election can start.");
  case SUSPENDED:
LOG.warn("Connection to ZooKeeper suspended. The contender " + 
leaderContender.getAddress()
  + "no longer participates in the leader election.");
  case RECONNECTED:
LOG.info("Connection to ZooKeeper was reconnected. Leader election can 
be restarted.");
  case LOST:
// Maybe we have to throw an exception here to terminate the JobManager
LOG.warn("Connection to ZooKeeper lost. The contender " + 
leaderContender.getAddress()
  + "no longer participates in the leader election.");
{code}
Any of the first 3 states would result in multiple log output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3809) Missing break in ZooKeeperLeaderRetrievalService#handleStateChange()

2016-04-25 Thread Ted Yu (JIRA)
Ted Yu created FLINK-3809:
-

 Summary: Missing break in 
ZooKeeperLeaderRetrievalService#handleStateChange()
 Key: FLINK-3809
 URL: https://issues.apache.org/jira/browse/FLINK-3809
 Project: Flink
  Issue Type: Bug
Reporter: Ted Yu


{code}
  protected void handleStateChange(ConnectionState newState) {
switch (newState) {
  case CONNECTED:
LOG.debug("Connected to ZooKeeper quorum. Leader retrieval can start.");
  case SUSPENDED:
LOG.warn("Connection to ZooKeeper suspended. Can no longer retrieve the 
leader from " +
  "ZooKeeper.");
  case RECONNECTED:
LOG.info("Connection to ZooKeeper was reconnected. Leader retrieval can 
be restarted.");
  case LOST:
LOG.warn("Connection to ZooKeeper lost. Can no longer retrieve the 
leader from " +
  "ZooKeeper.");
}
{code}
Except for LOST state, the other states would lead to multiple logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Pull request failed Travis...what's next?

2016-04-25 Thread Sourigna Phetsarath
Who's responsible for merging the PRs?  What's the usual timeline for
feedback and/or merging?

Thank you.

On Thu, Apr 21, 2016 at 6:09 PM, Flavio Pompermaier 
wrote:

> We just issued a PR about FLINK-1827 (
> https://github.com/apache/flink/pull/1915) that improves test stability
> except for the ml library that has still some problem to solve..
> On 21 Apr 2016 23:59, "Fabian Hueske"  wrote:
>
> > Hi Sourigna,
> >
> > thanks for contributing!
> >
> > Unrelated test failures are not a problem. We have some issues with build
> > stability on Travis.
> > We know which tests are failing from time to time and run the tests again
> > before merging.
> >
> > Thanks
> >
> > 2016-04-21 23:18 GMT+02:00 Sourigna Phetsarath <
> gna.phetsar...@teamaol.com
> > >:
> >
> > > All:
> > >
> > > I created this pull request and it failed on the flink-ml module that I
> > > didn't touch.  What are next steps?
> > >
> > > https://github.com/apache/flink/pull/1920
> > >
> > > Thank you for any help.
> > >
> > > --
> > >
> > >
> > > *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> > > Applied Research Chapter
> > > 770 Broadway, 5th Floor, New York, NY 10003
> > > o: 212.402.4871 // m: 917.373.7363
> > > vvmr: 8890237 aim: sphetsarath20 t: @sourigna
> > >
> > > * *
> > >
> >
>



-- 


*Gna Phetsarath*System Architect // AOL Platforms // Data Services //
Applied Research Chapter
770 Broadway, 5th Floor, New York, NY 10003
o: 212.402.4871 // m: 917.373.7363
vvmr: 8890237 aim: sphetsarath20 t: @sourigna

* *


Re: Parallelizing ExecutionConfig.fromCollection

2016-04-25 Thread Till Rohrmann
Hi Greg,

I think we haven't discussed the opportunity for a parallelized collection
input format, yet. Thanks for bringing this up.

I think it should be possible to implement a generic parallel collection
input format. However, I have two questions here:

1. Is it really a problem for users that their job exceeds the akka frame
size limit when using the collection input format? The collection input
format should be used primarily for testing and, thus, the data should be
rather small.

2. Which message is causing the frame size problem? If it is the task
deployment descriptor, then a parallelized collection input format which
works with input splits can solve the problem. If the problem is rather the
`SubmitJob` message, then we have to tackle the problem differently. The
reason is that the input splits are only created on the `JobManager`.
Before, the collection is simply written into the task config of the data
source `JobVertex`, because we don't know the number of sub tasks yet. In
the latter case, which can also be cause by large closure objects, we
should send the job via the blob manager to the `JobManager` to solve the
problem.

Cheers,
Till

On Mon, Apr 25, 2016 at 3:45 PM, Greg Hogan  wrote:

> Hi,
>
> CollectionInputFormat currently enforces a parallelism of 1 by implementing
> NonParallelInput and serializing the entire Collection. If my understanding
> is correct this serialized InputFormat is often the cause of a new job
> exceeding the akka message size limit.
>
> As an alternative the Collection elements could be serialized into multiple
> InputSplits. Has this idea been considered and rejected?
>
> Thanks,
> Greg
>


Parallelizing ExecutionConfig.fromCollection

2016-04-25 Thread Greg Hogan
Hi,

CollectionInputFormat currently enforces a parallelism of 1 by implementing
NonParallelInput and serializing the entire Collection. If my understanding
is correct this serialized InputFormat is often the cause of a new job
exceeding the akka message size limit.

As an alternative the Collection elements could be serialized into multiple
InputSplits. Has this idea been considered and rejected?

Thanks,
Greg


Re: [DISCUSS] Methods for translating Graphs

2016-04-25 Thread Fabian Hueske
Hi Greg,

sorry for the late reply.
I am not super familiar with Gelly, but the use cases you describe sound
quite common to me.
I had a (very) brief look at the PR and the changes seem to be rather
lightweight.
So, in my opinion this looks like a valuable addition.

Thanks, Fabian


2016-04-21 18:06 GMT+02:00 Greg Hogan :

> Vasia and I are looking for additional feedback on FLINK-3771. This ticket
> [0] and PR [1] provides methods for translating the type or value of graph
> labels, vertex values, and edge values. My use cases are provided in JIRA,
> but I think users will find many more.
>
> Translators compose well with graph generators and graph algorithms, which
> are restricted to emitting a wide type. Translators can also be used for
> processing and merging input data.
>
> The equivalent facility on DataSet is simply a MapFunction. Graphs have
> additional structure in having two DataSets, vertices and edges, as well as
> the label type binding the two.
>
> The PR provides three basic implementations of the Translator interface as
> well as Translate methods which operate on graphs, vertices, and edges.
>
> Greg
>
> [0] https://issues.apache.org/jira/browse/FLINK-3771
> [1] https://github.com/apache/flink/pull/1900/files
>


Re: Sqoop-like module in Flink

2016-04-25 Thread Fabian Hueske
Hi Flavio,

sorry for not replying earlier.
I think there is definitely need to improve the JdbcInputFormat.
All your points wrt to the current JdbcInputFormat are valid and fixing
them would be a big improvement and highly welcome contribution, IMO.

I am not so sure about adding a flink-sqoop module to Flink.
How much better/faster would flink-sqoop be compared to Apache Scoop. With
YARN it is easy to use two frameworks side-by-side.
Maybe you can share a few details about your use case / environment and why
flink-sqoop would be a good addition.

Best, Fabian


2016-04-15 10:03 GMT+02:00 Stefano Bortoli :

> Hi Flavio,
>
> I think this can be very handy when you have to run jobs Sqoop-like but you
> need to run the process with few resources. As for Cascading, Flink could
> do the heavy-lifting and make the scan of large relational databases more
> robust. Of course to make it work in real world, the JDBC Input format must
> be improved. Besides parallelism, null values, and related inputsplit, we
> need to find a way to map properly the Java types towards the database
> types. Probably having a wrapper POJO implementing cast/tranformation
> policy passed as a parameter of the InputFormat could do. Another thing we
> need to take care of is the management of connections, which can be very
> costly if the database is particularly large.
>
>
> saluti,
> Stefano
>
>
>
> 2016-04-13 12:45 GMT+02:00 Flavio Pompermaier :
>
> > Hi to all,
> > we've recently migrated our sqoop[1] import process to a Flink job, using
> > an improved version of the Flink JDBC Input Format[2] that is able to
> > exploit the parallelism of the cluster (the current Flink version
> > implements NonParallelInput).
> >
> > Still need to improve the mapping part of sql types to java ones (in the
> > addValue method IMHO) but this could be the basis for a flink-sqoop
> module
> > that will incrementally cover the sqoop functionalities when requested.
> > Do you think that such a module could be of interest for Flink or not?
> >
> > [1] https://sqoop.apache.org/
> > [2]
> https://gist.github.com/fpompermaier/bcd704abc93b25b6744ac76ac17ed351
> >
> > Best,
> > Flavio
> >
>


Re: Eclipse Problems

2016-04-25 Thread Robert Metzger
Cool, thank you for working on this!

On Mon, Apr 25, 2016 at 1:37 PM, Matthias J. Sax  wrote:

> I can confirm that the SO answer works.
>
> I will add a note to the Eclipse setup guide at the web site.
>
> -Matthias
>
>
> On 04/25/2016 11:33 AM, Robert Metzger wrote:
> > It seems that the user resolved the issue on SO, right?
> >
> > On Mon, Apr 25, 2016 at 11:31 AM, Ufuk Celebi  wrote:
> >
> >> On Mon, Apr 25, 2016 at 12:14 AM, Matthias J. Sax 
> >> wrote:
> >>> What do you think about this?
> >>
> >> Hey Matthias!
> >>
> >> Thanks for bringing this up.
> >>
> >> I think it is very desirable to keep support for Eclipse. It's quite a
> >> high barrier for new contributors to enforce a specific IDE (although
> >> IntelliJ is gaining quite the user base I think :P).
> >>
> >> Do you have time to look into this?
> >>
> >> – Ufuk
> >>
> >
>
>


Re: Eclipse Problems

2016-04-25 Thread Matthias J. Sax
I can confirm that the SO answer works.

I will add a note to the Eclipse setup guide at the web site.

-Matthias


On 04/25/2016 11:33 AM, Robert Metzger wrote:
> It seems that the user resolved the issue on SO, right?
> 
> On Mon, Apr 25, 2016 at 11:31 AM, Ufuk Celebi  wrote:
> 
>> On Mon, Apr 25, 2016 at 12:14 AM, Matthias J. Sax 
>> wrote:
>>> What do you think about this?
>>
>> Hey Matthias!
>>
>> Thanks for bringing this up.
>>
>> I think it is very desirable to keep support for Eclipse. It's quite a
>> high barrier for new contributors to enforce a specific IDE (although
>> IntelliJ is gaining quite the user base I think :P).
>>
>> Do you have time to look into this?
>>
>> – Ufuk
>>
> 



signature.asc
Description: OpenPGP digital signature


Re: Partition problem

2016-04-25 Thread Fabian Hueske
Hi Andrew,

I might be wrong, but I think this problem is caused by an assumption of
how Flink reads input data.
In Flink, each InputSplit is not read by a new task and a split does not
correspond to a partition. This is different from how Hadoop MR and Spark
handle InputSplits.

Instead, Flink creates as many DataSource tasks as specified by the task
parallelism and lazily assigns InputSplits to its subtasks. Idle DataSource
subtasks request InputSplits from the JobManager and the assignment happens
first-come-first-serve.
Hence, the subtask ID (or partition ID) of an InputSplit is not
deterministic and a DataSource might read more than one or also no split at
all (such as in your case).

If you need the split ID in your program, you can implement an InputFormat,
which wraps another IF and assigns the ID of the current InputSplit to the
read data, i.e., converts the DataType from T to Tuple2[Int, T].

Hope this helps,
Fabian


2016-04-25 11:27 GMT+02:00 Till Rohrmann :

> Hi Andrew,
>
> I think the problem is that you assume that both matrices have the same
> partitioning. If you guarantee that this is the case, then you can use the
> subtask index as the block index. But in the general case this is not true,
> and then you have to calculate the blocks by first assigning a block index
> (e.g. rows with 0-9 index are assigned to block 0, rows with 10-19 assigned
> to block 1, etc.) and then create the blocks by reducing on this block
> index. That's because the distribution of the individual rows in the
> cluster is not necessarily the same between two matrices.
>
> Cheers,
> Till
>
> On Mon, Apr 25, 2016 at 1:40 AM, Andrew Palumbo 
> wrote:
>
> > Hi All,
> >
> >
> > I've run into a problem with empty partitions when the number of elements
> > in a DataSet is less than the Degree of Parallelism.  I've created a gist
> > here to describe it:
> >
> >
> > https://gist.github.com/andrewpalumbo/1768dac6d2f5fdf963abacabd859aaf3
> >
> >
> > I have two 2x2 matrices, Matrix A and Matrix B and an execution
> > environment where the degree of parallelism is 4. Both matrices   are
> > blockified in  2 different DataSet s . In this case (the case of a 2x2
> > matrices with 4 partitions) this means that each row goes into a
> partition
> > leaving 2 empty partitions. In Matrix A, the rows go into partitions 0,
> 1.
> > However the rows of Matrix B end up in partitions 1, 2. I assign the
> > ordinal index of the blockified matrix's partition to its block, and then
> > join on that index.
> >
> >
> > However in this case, with differently partitioned matrices of the same
> > geometry, the intersection of the blockified matrices' indices is 1, and
> > partitions 0 and 2 are dropped.
> >
> >
> > I've tried explicitly defining the dop for Matrix B using the count of
> > non-empty partitions in Matrix A, however this changes the order of the
> > DataSet, placing partition 2 into partition 0.
> >
> >
> > Is there a way to make sure that these datasets are partitioned in the
> > same way?
> >
> >
> > Thank you,
> >
> >
> > Andy
> >
> >
> >
>


Re: [DISCUSS] Release Flink 1.0.3

2016-04-25 Thread Ufuk Celebi
+1 from my side, Robert. It's not changing anything for people who
don't configure it. We also did it for 1.0.2 with the DataSetUtils.

On Mon, Apr 25, 2016 at 11:06 AM, Robert Metzger  wrote:
> I'm currently working on the Flink+Bigtop integration and I found this
> issue quite annoying: https://issues.apache.org/jira/browse/FLINK-3678
>
> In my opinion its a bugfix we can include into the release.
> Any objections?
>
> On Mon, Apr 25, 2016 at 10:48 AM, Ufuk Celebi  wrote:
>
>> Hey all,
>>
>> thanks to Robert for starting the discussion.
>>
>> - FLINK-3790 is merged
>> - FLINK-3800 looks good to merge after a test issue is resolved
>> - FLINK-3701 needs some feedback on the latest changes
>>
>> After we have resolved these, I can kick off the first RC for 1.0.3.
>>
>> Are there any other fixes we want in after the mentioned issues have
>> been resolved?
>>
>> – Ufuk
>>
>>
>> On Fri, Apr 22, 2016 at 11:17 AM, Aljoscha Krettek 
>> wrote:
>> > https://issues.apache.org/jira/browse/FLINK-3701 does not affect the
>> 1.0.x
>> > series of releases, I think we also discussed this in the 1.0.2 release
>> > thread.
>> >
>> > On Fri, 22 Apr 2016 at 10:39 Kostas Kloudas > >
>> > wrote:
>> >
>> >> I am working on:
>> >> https://issues.apache.org/jira/browse/FLINK-2314?filter=-1 <
>> >> https://issues.apache.org/jira/browse/FLINK-2314?filter=-1>
>> >> which I believe will also affect:
>> >> https://issues.apache.org/jira/browse/FLINK-3796
>> >>
>> >> Essentially the FileSourceFunction will become obsolete.
>> >>
>> >> > On Apr 22, 2016, at 10:21 AM, Maximilian Michels 
>> wrote:
>> >> >
>> >> > Hi Robert,
>> >> >
>> >> > Thanks for starting a new thread for Flink 1.0.3.
>> >> >
>> >> > FLINK-3701 can be removed. It's only affects the upcoming 1.1.0
>> >> > release. I would like to add
>> >> > https://issues.apache.org/jira/browse/FLINK-3796
>> >> >
>> >> > Cheers,
>> >> > Max
>> >> >
>> >> > On Thu, Apr 21, 2016 at 7:04 PM, Robert Metzger 
>> >> wrote:
>> >> >> Hi,
>> >> >>
>> >> >> in the 1.0.2 release thread we started already collecting issues for
>> the
>> >> >> next bugfix release.
>> >> >>
>> >> >> I think it would be helpful to keep the two discussions separate and
>> >> start
>> >> >> a new thread for the 1.0.3 rel.
>> >> >>
>> >> >> Fixes to be included:
>> >> >> https://issues.apache.org/jira/browse/FLINK-3790
>> >> >> https://issues.apache.org/jira/browse/FLINK-3800
>> >> >> https://issues.apache.org/jira/browse/FLINK-3701
>> >> >>
>> >> >> Please add more JIRAs to the list if you have some.
>> >>
>> >>
>>


Re: Eclipse Problems

2016-04-25 Thread Robert Metzger
It seems that the user resolved the issue on SO, right?

On Mon, Apr 25, 2016 at 11:31 AM, Ufuk Celebi  wrote:

> On Mon, Apr 25, 2016 at 12:14 AM, Matthias J. Sax 
> wrote:
> > What do you think about this?
>
> Hey Matthias!
>
> Thanks for bringing this up.
>
> I think it is very desirable to keep support for Eclipse. It's quite a
> high barrier for new contributors to enforce a specific IDE (although
> IntelliJ is gaining quite the user base I think :P).
>
> Do you have time to look into this?
>
> – Ufuk
>


Re: Eclipse Problems

2016-04-25 Thread Ufuk Celebi
On Mon, Apr 25, 2016 at 12:14 AM, Matthias J. Sax  wrote:
> What do you think about this?

Hey Matthias!

Thanks for bringing this up.

I think it is very desirable to keep support for Eclipse. It's quite a
high barrier for new contributors to enforce a specific IDE (although
IntelliJ is gaining quite the user base I think :P).

Do you have time to look into this?

– Ufuk


Re: Partition problem

2016-04-25 Thread Till Rohrmann
Hi Andrew,

I think the problem is that you assume that both matrices have the same
partitioning. If you guarantee that this is the case, then you can use the
subtask index as the block index. But in the general case this is not true,
and then you have to calculate the blocks by first assigning a block index
(e.g. rows with 0-9 index are assigned to block 0, rows with 10-19 assigned
to block 1, etc.) and then create the blocks by reducing on this block
index. That's because the distribution of the individual rows in the
cluster is not necessarily the same between two matrices.

Cheers,
Till

On Mon, Apr 25, 2016 at 1:40 AM, Andrew Palumbo  wrote:

> Hi All,
>
>
> I've run into a problem with empty partitions when the number of elements
> in a DataSet is less than the Degree of Parallelism.  I've created a gist
> here to describe it:
>
>
> https://gist.github.com/andrewpalumbo/1768dac6d2f5fdf963abacabd859aaf3
>
>
> I have two 2x2 matrices, Matrix A and Matrix B and an execution
> environment where the degree of parallelism is 4. Both matrices   are
> blockified in  2 different DataSet s . In this case (the case of a 2x2
> matrices with 4 partitions) this means that each row goes into a partition
> leaving 2 empty partitions. In Matrix A, the rows go into partitions 0, 1.
> However the rows of Matrix B end up in partitions 1, 2. I assign the
> ordinal index of the blockified matrix's partition to its block, and then
> join on that index.
>
>
> However in this case, with differently partitioned matrices of the same
> geometry, the intersection of the blockified matrices' indices is 1, and
> partitions 0 and 2 are dropped.
>
>
> I've tried explicitly defining the dop for Matrix B using the count of
> non-empty partitions in Matrix A, however this changes the order of the
> DataSet, placing partition 2 into partition 0.
>
>
> Is there a way to make sure that these datasets are partitioned in the
> same way?
>
>
> Thank you,
>
>
> Andy
>
>
>


Re: [DISCUSS] Release Flink 1.0.3

2016-04-25 Thread Ufuk Celebi
Hey all,

thanks to Robert for starting the discussion.

- FLINK-3790 is merged
- FLINK-3800 looks good to merge after a test issue is resolved
- FLINK-3701 needs some feedback on the latest changes

After we have resolved these, I can kick off the first RC for 1.0.3.

Are there any other fixes we want in after the mentioned issues have
been resolved?

– Ufuk


On Fri, Apr 22, 2016 at 11:17 AM, Aljoscha Krettek  wrote:
> https://issues.apache.org/jira/browse/FLINK-3701 does not affect the 1.0.x
> series of releases, I think we also discussed this in the 1.0.2 release
> thread.
>
> On Fri, 22 Apr 2016 at 10:39 Kostas Kloudas 
> wrote:
>
>> I am working on:
>> https://issues.apache.org/jira/browse/FLINK-2314?filter=-1 <
>> https://issues.apache.org/jira/browse/FLINK-2314?filter=-1>
>> which I believe will also affect:
>> https://issues.apache.org/jira/browse/FLINK-3796
>>
>> Essentially the FileSourceFunction will become obsolete.
>>
>> > On Apr 22, 2016, at 10:21 AM, Maximilian Michels  wrote:
>> >
>> > Hi Robert,
>> >
>> > Thanks for starting a new thread for Flink 1.0.3.
>> >
>> > FLINK-3701 can be removed. It's only affects the upcoming 1.1.0
>> > release. I would like to add
>> > https://issues.apache.org/jira/browse/FLINK-3796
>> >
>> > Cheers,
>> > Max
>> >
>> > On Thu, Apr 21, 2016 at 7:04 PM, Robert Metzger 
>> wrote:
>> >> Hi,
>> >>
>> >> in the 1.0.2 release thread we started already collecting issues for the
>> >> next bugfix release.
>> >>
>> >> I think it would be helpful to keep the two discussions separate and
>> start
>> >> a new thread for the 1.0.3 rel.
>> >>
>> >> Fixes to be included:
>> >> https://issues.apache.org/jira/browse/FLINK-3790
>> >> https://issues.apache.org/jira/browse/FLINK-3800
>> >> https://issues.apache.org/jira/browse/FLINK-3701
>> >>
>> >> Please add more JIRAs to the list if you have some.
>>
>>