adsharma commented on issue #679: URL: https://github.com/apache/incubator-graphar/issues/679#issuecomment-3320914564
Thank you for explaining! My solution is motivated by trying to serve wikidata (90 million nodes, 800+ million edges) from kuzu. The on-disk storage requirements were unacceptable due to denormalization. I'm looking to serve graphs 10x that size. So on-disk and selective loading is the main use case. I also want to compete with LLMs in terms of graph compression and storage efficiency by offloading some of the knowledge stored there into external storage. Parquet files as they stand now aren't sufficient, but a step in the right direction. I don't want to specify whether the edges should be sorted by type or by graph structure. Depends on the use case. Want to support both well. Kuzu folks have made a decision to support strongly typed nodes and edges. But you can always store weakly typed graphs by merging them all into a "node" table and a "rel" table. If the parquet file is sorted, you can do predicate pushdowns. DuckDB and Spark do it. DuckDB native storage is also supported as an additional single file option. Why? It has a few more [compression tricks](https://duckdb.org/2025/09/08/duckdb-on-the-framework-laptop-13.html#tpc-h-sf10000) and single file is more convenient. In the TPC-H SF10k example, parquet files were 4TB, but duckdb was 2.7TB. Kuzu has an extension to read from duckdb. But I'm not sure if it can handle TB sized files and do efficient predicate pushdowns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@graphar.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@graphar.apache.org For additional commands, e-mail: commits-h...@graphar.apache.org