glasser commented on a change in pull request #7048: Make
IngestSegmentFirehoseFactory splittable for parallel ingestion
URL: https://github.com/apache/incubator-druid/pull/7048#discussion_r268355557
##########
File path:
indexing-service/src/main/java/org/apache/druid/indexing/firehose/IngestSegmentFirehoseFactory.java
##########
@@ -52,28 +56,38 @@
import org.apache.druid.timeline.TimelineObjectHolder;
import org.apache.druid.timeline.VersionedIntervalTimeline;
import org.apache.druid.timeline.partition.PartitionChunk;
+import org.apache.druid.timeline.partition.PartitionHolder;
import org.joda.time.Duration;
import org.joda.time.Interval;
import javax.annotation.Nullable;
import java.io.File;
import java.io.IOException;
+import java.util.ArrayList;
import java.util.Collections;
+import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
+import java.util.SortedMap;
+import java.util.TreeMap;
import java.util.concurrent.ThreadLocalRandom;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
+import java.util.stream.Stream;
-public class IngestSegmentFirehoseFactory implements
FirehoseFactory<InputRowParser>
+public class IngestSegmentFirehoseFactory implements
FiniteFirehoseFactory<InputRowParser, List<WindowedSegmentId>>
{
private static final EmittingLogger log = new
EmittingLogger(IngestSegmentFirehoseFactory.class);
+ private static final long DEFAULT_MAX_INPUT_SEGMENT_BYTES_PER_TASK = 150 *
1024 * 1024;
Review comment:
You know, I swear I did when I wrote this, but I can't remember now and I
clearly didn't write it down. Do you have any suggestions?
The default `maxRowsPerSegment` for Kafka indexing seems like a reasonable
place to look at to start, but then one has to think about how many bytes are
in a typical row and how many segments we'd like each task to produce. My
default here is probably too low?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]