[jira] [Commented] (TINKERPOP-3133) Customize the file count by repartition the OutputRDD in Spark to reduce HDFS small files

ASF GitHub Bot (Jira) Wed, 12 Feb 2025 06:51:24 -0800


    [ 
https://issues.apache.org/jira/browse/TINKERPOP-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926406#comment-17926406
 ]


ASF GitHub Bot commented on TINKERPOP-3133:
-------------------------------------------

ministat commented on code in PR #3026:
URL: https://github.com/apache/tinkerpop/pull/3026#discussion_r1952802146


##########
spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/structure/io/OutputFormatRDD.java:
##########
@@ -75,4 +77,17 @@ public <K, V> Iterator<KeyValue<K, V>> writeMemoryRDD(final 
Configuration config
         }
         return Collections.emptyIterator();
     }
-}
\ No newline at end of file
+
+    /**
+     * Allow users to customize the RDD partitions to reduce HDFS small files
+     */
+    private <K, V> JavaPairRDD<K, V> repartitionJavaPairRDD(final 
org.apache.hadoop.conf.Configuration hadoopConfiguration, JavaPairRDD<K, V> 
graphRDD) {
+        JavaPairRDD<K, V> javaPairRDD = graphRDD;
+        final String repartitionString = 
hadoopConfiguration.get(Constants.GREMLIN_SPARK_OUTPUT_REPARTITION);

Review Comment:
   PersistedOutputRDD wants to persist the RDD, which is a little different 
from writing RDD to HDFS and does not generate small files. But in order to 
keep consistent, I also apply the change.
   Here is the summary from ChatGPT: 
   
   > In summary, persisting an RDD in Spark does not directly lead to the 
generation of small files in HDFS. The generation of small files in HDFS is 
more commonly associated with writing RDD data to HDFS using methods like 
saveAsTextFile, which can happen independently of persisting the RDD in memory 
or disk.





> Customize the file count by repartition the OutputRDD in Spark to reduce HDFS 
> small files
> -----------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-3133
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-3133
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: hadoop
>    Affects Versions: 3.7.3
>            Reporter: Redriver
>            Priority: Major
>
> The Graph export to HDFS through OutputRDD, but we often saw there are many 
> small files in production environment. For example, there are more than 
> 50,000 files and each is about 17 MB, which will trigger HDFS small files 
> alerts. So, it is better allow customize the output file numbers by 
> repartition the OutputRDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TINKERPOP-3133) Customize the file count by repartition the OutputRDD in Spark to reduce HDFS small files

Reply via email to