[jira] [Commented] (TINKERPOP-3133) Customize the file count by repartition the OutputRDD in Spark to reduce HDFS small files

ASF GitHub Bot (Jira) Thu, 13 Feb 2025 18:36:10 -0800


    [ 
https://issues.apache.org/jira/browse/TINKERPOP-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927023#comment-17927023
 ]


ASF GitHub Bot commented on TINKERPOP-3133:
-------------------------------------------

ministat commented on PR #3026:
URL: https://github.com/apache/tinkerpop/pull/3026#issuecomment-2658130010

   > What's the impact of this change if the users do not explicitly configure 
the spark output partitioning? Does this change impact the default behaviour in 
any meaningful way?
   If not explicitly configure this option, the output partitions number is 
determined by the input dataset. If the input data contains many partitions 
based on the default partition policy in Spark, that will cause small HDFS 
files problem. That is why I create this PR. But if user can tolerate small 
file problem, it is ok to not configure this option.
   > 
   > Also is this intended to be targeting the master branch or is it intended 
for 3.7-dev?
   In my previous PR, I targeted 3.7-dev, so I follow it here. Shall I change 
to target for master?
   > 
   > Could you also add a quick changelog entry and document the new 
configuration in 
https://github.com/apache/tinkerpop/blob/master/docs/src/reference/implementations-spark.asciidoc?
   Sure
   
   




> Customize the file count by repartition the OutputRDD in Spark to reduce HDFS 
> small files
> -----------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-3133
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-3133
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: hadoop
>    Affects Versions: 3.7.3
>            Reporter: Redriver
>            Priority: Major
>
> The Graph export to HDFS through OutputRDD, but we often saw there are many 
> small files in production environment. For example, there are more than 
> 50,000 files and each is about 17 MB, which will trigger HDFS small files 
> alerts. So, it is better allow customize the output file numbers by 
> repartition the OutputRDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TINKERPOP-3133) Customize the file count by repartition the OutputRDD in Spark to reduce HDFS small files

Reply via email to