RE: Performance improvements for sorted RDDs

2016-03-21 Thread JOAQUIN GUANTER GONZALBEZ
Hi Daniel,

I am glad you already ran the numbers on this change ☺ (for anyone reading, 
they can be found on slide 19 in 
http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos
 ). I haven’t done any formal benchmarking, but the speedup in our jobs is 
highly noticeable.

I agree it can be done without modifying Spark (we also have our own 
implementation in my codebase), but it seems a pity that anyone using the RDD 
API won’t get the benefits having a sorted RDD (which happens quite often since 
the shuffle phase can sort!).

Ximo.

De: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com]
Enviado el: lunes, 21 de marzo de 2016 16:20
Para: Ted Yu 
CC: JOAQUIN GUANTER GONZALBEZ ; 
dev@spark.apache.org
Asunto: Re: Performance improvements for sorted RDDs

There is related discussion in 
https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to 
implement this without modifying Spark and we measured ~10x improvement over 
plain RDD joins. I haven't benchmarked against DataFrames -- maybe they also 
realize this performance advantage.

On Mon, Mar 21, 2016 at 11:41 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:
Do you have performance numbers to backup this proposal for cogroup operation ?

Thanks

On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ 
mailto:joaquin.guantergonzal...@telefonica.com>>
 wrote:
Hello devs,

I have found myself in a situation where Spark is doing sub-optimal 
computations for my RDDs, and I was wondering whether a patch to enable 
improved performance for this scenario would be a welcome addition to Spark or 
not.

The scenario happens when trying to cogroup two RDDs that are sorted by key and 
share the same partitioner. CoGroupedRDD will correctly detect that the RDDs 
have the same partitioner and will therefore create narrow cogroup split 
dependencies, as opposed to shuffle dependencies. This is great because it 
prevents any shuffling from happening. However, the cogroup is unable to detect 
that the RDDs are sorted in the same way, and will still insert all elements of 
the RDD in a map in order to join the elements with the same key.

When both RDDs are sorted using the same order, the cogroup can just join by 
doing a single pass over the data (since the data is ordered by key, you can 
just keep iterating until you find a different key). This would greatly reduce 
the memory requirements for these kind of operations.

Adding this to spark would require adding an “ordering” member to RDD of type 
Option[Ordering], similarly to how the “partitioner” field works. That way, the 
sorting operations could populate this field and the operations that could 
benefit from this knowledge (cogroup, join, groupbykey, etc.) could read it to 
change their behavior accordingly.

Do you think this would be a good addition to Spark?

Thanks,
Ximo



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destruição





Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda 

Re: Performance improvements for sorted RDDs

2016-03-21 Thread Daniel Darabos
There is related discussion in
https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to
implement this without modifying Spark and we measured ~10x improvement
over plain RDD joins. I haven't benchmarked against DataFrames -- maybe
they also realize this performance advantage.

On Mon, Mar 21, 2016 at 11:41 AM, Ted Yu  wrote:

> Do you have performance numbers to backup this proposal for cogroup
> operation ?
>
> Thanks
>
> On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ <
> joaquin.guantergonzal...@telefonica.com> wrote:
>
>> Hello devs,
>>
>>
>>
>> I have found myself in a situation where Spark is doing sub-optimal
>> computations for my RDDs, and I was wondering whether a patch to enable
>> improved performance for this scenario would be a welcome addition to Spark
>> or not.
>>
>>
>>
>> The scenario happens when trying to cogroup two RDDs that are sorted by
>> key and share the same partitioner. CoGroupedRDD will correctly detect that
>> the RDDs have the same partitioner and will therefore create narrow cogroup
>> split dependencies, as opposed to shuffle dependencies. This is great
>> because it prevents any shuffling from happening. However, the cogroup is
>> unable to detect that the RDDs are sorted in the same way, and will still
>> insert all elements of the RDD in a map in order to join the elements with
>> the same key.
>>
>>
>>
>> When both RDDs are sorted using the same order, the cogroup can just join
>> by doing a single pass over the data (since the data is ordered by key, you
>> can just keep iterating until you find a different key). This would greatly
>> reduce the memory requirements for these kind of operations.
>>
>>
>>
>> Adding this to spark would require adding an “ordering” member to RDD of
>> type Option[Ordering], similarly to how the “partitioner” field works. That
>> way, the sorting operations could populate this field and the operations
>> that could benefit from this knowledge (cogroup, join, groupbykey, etc.)
>> could read it to change their behavior accordingly.
>>
>>
>>
>> Do you think this would be a good addition to Spark?
>>
>>
>>
>> Thanks,
>>
>> Ximo
>>
>> --
>>
>> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
>> puede contener información privilegiada o confidencial y es para uso
>> exclusivo de la persona o entidad de destino. Si no es usted. el
>> destinatario indicado, queda notificado de que la lectura, utilización,
>> divulgación y/o copia sin autorización puede estar prohibida en virtud de
>> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
>> que nos lo comunique inmediatamente por esta misma vía y proceda a su
>> destrucción.
>>
>> The information contained in this transmission is privileged and
>> confidential information intended only for the use of the individual or
>> entity named above. If the reader of this message is not the intended
>> recipient, you are hereby notified that any dissemination, distribution or
>> copying of this communication is strictly prohibited. If you have received
>> this transmission in error, do not read it. Please immediately reply to the
>> sender that you have received this communication in error and then delete
>> it.
>>
>> Esta mensagem e seus anexos se dirigem exclusivamente ao seu
>> destinatário, pode conter informação privilegiada ou confidencial e é para
>> uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o
>> destinatário indicado, fica notificado de que a leitura, utilização,
>> divulgação e/ou cópia sem autorização pode estar proibida em virtude da
>> legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos
>> o comunique imediatamente por esta mesma via e proceda a sua destruição
>>
>
>


Re: Performance improvements for sorted RDDs

2016-03-21 Thread Ted Yu
Do you have performance numbers to backup this proposal for cogroup
operation ?

Thanks

On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ <
joaquin.guantergonzal...@telefonica.com> wrote:

> Hello devs,
>
>
>
> I have found myself in a situation where Spark is doing sub-optimal
> computations for my RDDs, and I was wondering whether a patch to enable
> improved performance for this scenario would be a welcome addition to Spark
> or not.
>
>
>
> The scenario happens when trying to cogroup two RDDs that are sorted by
> key and share the same partitioner. CoGroupedRDD will correctly detect that
> the RDDs have the same partitioner and will therefore create narrow cogroup
> split dependencies, as opposed to shuffle dependencies. This is great
> because it prevents any shuffling from happening. However, the cogroup is
> unable to detect that the RDDs are sorted in the same way, and will still
> insert all elements of the RDD in a map in order to join the elements with
> the same key.
>
>
>
> When both RDDs are sorted using the same order, the cogroup can just join
> by doing a single pass over the data (since the data is ordered by key, you
> can just keep iterating until you find a different key). This would greatly
> reduce the memory requirements for these kind of operations.
>
>
>
> Adding this to spark would require adding an “ordering” member to RDD of
> type Option[Ordering], similarly to how the “partitioner” field works. That
> way, the sorting operations could populate this field and the operations
> that could benefit from this knowledge (cogroup, join, groupbykey, etc.)
> could read it to change their behavior accordingly.
>
>
>
> Do you think this would be a good addition to Spark?
>
>
>
> Thanks,
>
> Ximo
>
> --
>
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
>
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to the
> sender that you have received this communication in error and then delete
> it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação vigente.
> Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
> imediatamente por esta mesma via e proceda a sua destruição
>