[ 
https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326455#comment-16326455
 ] 

Ruslan Dautkhanov edited comment on SPARK-23074 at 1/15/18 5:40 PM:
--------------------------------------------------------------------

{quote}You can create a DataFrame from the result of .zipWithIndex on an RDD, 
as you see here.
{quote}
That's very kludgy as you can see the code snippet above. A direct 
functionality for this RDD-level api would be super awesome to have.
{quote}There's already a rowNumber function in Spark SQL, however, which sounds 
like the native equivalent?
{quote}
That's not the same. rowNumber requires an expression to `order by` on. What if 
there is no such column? For example, we often 
 have files that we ingest into Spark and where we physical position of a 
record is meaningful how that record has to be processed. 
 I can give exact example if you're interested. zipWithIndex() actually the 
only one API call that preserves such information from 
 original source (even though it can be distributed into multiple partitions 
etc.).
 Also as folks in that stackoverflow question said, rowNumber approach is way 
slower (and it's not surprizing as it requires data sorting).


was (Author: tagar):
{quote}You can create a DataFrame from the result of .zipWithIndex on an RDD, 
as you see here.{quote}

That's very kludgy as you can see the code snippet above. A direct 
functionality for this RDD-level api would be super awesome to have.

{quote}There's already a rowNumber function in Spark SQL, however, which sounds 
like the native equivalent?{quote}

That's not the same. rowNumber requires an expression to `order by` on. What if 
there is no such column? For example, we often 
have files that we ingest into Spark and where we physical position of a record 
is meaningful how that record has to be processed. 
I can give exact example if you're necessary. zipWithIndex() actually the only 
one API call that preserves such information from 
original source (even though it can be distributed into multiple partitions 
etc.).
Also as folks in that stackoverflow question said, rowNumber approach is way 
slower (and it's not surprizing as it requires data sorting).



> Dataframe-ified zipwithindex
> ----------------------------
>
>                 Key: SPARK-23074
>                 URL: https://issues.apache.org/jira/browse/SPARK-23074
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Minor
>              Labels: dataframe, rdd
>
> Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex():
> {code:java}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types.{LongType, StructField, StructType}
> import org.apache.spark.sql.Row
> def dfZipWithIndex(
>   df: DataFrame,
>   offset: Int = 1,
>   colName: String = "id",
>   inFront: Boolean = true
> ) : DataFrame = {
>   df.sqlContext.createDataFrame(
>     df.rdd.zipWithIndex.map(ln =>
>       Row.fromSeq(
>         (if (inFront) Seq(ln._2 + offset) else Seq())
>           ++ ln._1.toSeq ++
>         (if (inFront) Seq() else Seq(ln._2 + offset))
>       )
>     ),
>     StructType(
>       (if (inFront) Array(StructField(colName,LongType,false)) else 
> Array[StructField]()) 
>         ++ df.schema.fields ++ 
>       (if (inFront) Array[StructField]() else 
> Array(StructField(colName,LongType,false)))
>     )
>   ) 
> }
> {code}
> credits: 
> [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to