Re: [I] [Java]Best practice with Apache/Spark [fury]

via GitHub Wed, 29 Jan 2025 22:26:26 -0800


chaokunyang commented on issue #2015:
URL: https://github.com/apache/fury/issues/2015#issuecomment-2623620236


   Hi @jayhan94 , we don't have such documents currently. A better fury 
integration with spark/flink would need to change the source code of 
serialization module in spark/flink, which is beyond the scope of this project. 
Maybe in future we can submit several proposal to spark/flink communities.
   
   Currently, if you want to use fury in spark/flink, you can update your 
driver program to add several chained(narrow dependency in spark) 
serialization/deserialization operators. 
   
   Here is a simple spark rdd example:
   ```scala
   val lines = sc.textFile("data.txt")
   val structSet = lines.map(s => Json.parse(s, Struct.class))
   kvset = structSet.map(s => (s.key, fury.serialize(s)))
   kvset.groupByKey().map(t => (t._1, 
fury.deserialize(t._2.first))).collect.foreach(println)
   ```
   
   Flink program will be similiar:
   ```java
   DataStream<Struct> dataStream = xxxstream.map(s -> Json.parse(s, 
Struct.class));
   DataStream<byte[]> byteStream = dataStream.map(s -> json.serialize(s));
   byteStream.rebalance().map(bytes -> (Struct)fury.deserialize(bytes));
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Java]Best practice with Apache/Spark [fury]

Reply via email to