[ https://issues.apache.org/jira/browse/TINKERPOP-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926866#comment-17926866 ]
ASF GitHub Bot commented on TINKERPOP-3133: ------------------------------------------- Cole-Greer commented on PR #3026: URL: https://github.com/apache/tinkerpop/pull/3026#issuecomment-2657237128 What's the impact of this change if the users do not explicitly configure the spark output partitioning? Does this change impact the default behaviour in any meaningful way? Also is this intended to be targeting the master branch or is it intended for 3.7-dev? > Customize the file count by repartition the OutputRDD in Spark to reduce HDFS > small files > ----------------------------------------------------------------------------------------- > > Key: TINKERPOP-3133 > URL: https://issues.apache.org/jira/browse/TINKERPOP-3133 > Project: TinkerPop > Issue Type: Improvement > Components: hadoop > Affects Versions: 3.7.3 > Reporter: Redriver > Priority: Major > > The Graph export to HDFS through OutputRDD, but we often saw there are many > small files in production environment. For example, there are more than > 50,000 files and each is about 17 MB, which will trigger HDFS small files > alerts. So, it is better allow customize the output file numbers by > repartition the OutputRDD. -- This message was sent by Atlassian Jira (v8.20.10#820010)