Iskander14yo opened a new issue, #917:
URL: https://github.com/apache/incubator-graphar/issues/917

   ### Describe the enhancement requested
   
   Today the conversion/import path does not seem to scale well for larger 
datasets.
   
   From reading the current code and trying it in practice, I see that:
   - the C++ high-level builders are convenience APIs and keep data in memory 
until `Dump()`
   - the Spark writer scales better in principle, but still does heavy batch 
work such as index generation, joins, sorting, repartitioning, and offset 
construction
   
   Because GraphAr is positioned for use with "large-scale graph data", it 
would be useful for a community to have a clearer path for scalable conversion.
   
   Assuming I'm not missing something, my suggestion is:
     - keep the C++ high-level writer/builder path simple/reference-oriented 
and convenient for small/medium imports
     - optimize the Spark API/writer as the primary path for large-scale 
conversion
   
   This way we treat Spark as the practical scalable backend for data lake, 
object stores, HDFS, and distributed preprocessing.
   
   Why Spark seems like the better place to optimize first:
   - Spark is considered a data lake first-class citizen, used by many orgs in 
production and thus in practice is more accessible for end-users (compared to a 
dedicated VM only for C++ import)
   - storage backends such as S3/HDFS are abstracted through Spark/Hadoop
   - large joins / remapping / repartitioning are natural Spark workloads
   - avoiding two separate “fully optimized” implementations (Spark and Cpp) 
may be easier to maintain long-term
   
   To sum up, would the project agree with this direction?
   
   If it sounds reasonable, I’d be happy to help investigate and propose 
concrete improvements in the Spark conversion path.
   
   ### Component(s)
   
   C++, Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to