[GitHub] jihoonson commented on issue #6989: Behavior of index_parallel with appendToExisting=false and no bucketIntervals in GranularitySpec is surprising

GitBox Wed, 06 Feb 2019 14:56:17 -0800

jihoonson commented on issue #6989: Behavior of index_parallel with 
appendToExisting=false and no bucketIntervals in GranularitySpec is surprising
URL: 
https://github.com/apache/incubator-druid/issues/6989#issuecomment-461221023
 
 
   @glasser 
   
   > 1\. Where should tests of this new logic end up?  And how do I actually 
run parts of the Druid test suite? I'm not very experienced with Maven — I know 
how to run `mvn install` to run all of the tests for all of Druid, but not 
anything more specific. (I use IntelliJ if that helps.)
   
   I would recommend to use Intellij or any other IDE you prefer. If you want 
to do in the terminal, you can run `mvn test -Dtest=TestClass` and `mvn verify 
-P integration-tests -Dit.test=TestClass` for unit tests and integration tests, 
respectively.
   
   > 2\. If it's OK to dynamically add locks one by one as the task runs, why 
do the local and hadoop indexing tasks do an initial scan to determine all the 
intervals at once? Do they need to do that scan for some other reason anyway?
   
   I think it's because of https://github.com/apache/incubator-druid/pull/4550. 
Those classes were written before 
https://github.com/apache/incubator-druid/pull/4550, and at that time, there 
was no concept of revoking locks. As a result, if two or more tasks get locks 
one by one dynamically, they might get stuck in the middle of ingestion. 
Moreover, it can cause a deadlock if they block each other. However, now, the 
higher priority tasks can preempt lower priority tasks.
   
   > 3\. General batch ingestion/segment replacement question: if you're using 
batch ingestion (of any kind: Hadoop, native, local) with granularitySpec 
interval specified and appendToExisting false, to re-ingest to an interval that 
already contains data, but there is an time chunk of the data source's segment 
granularity that has no row in your batch ingestion run, what will happen to 
the data in that time chunk? It seems to me that nothing will happen because I 
haven't seen anything that creates empty segments for a time chunk with no 
data, and so there's no segment to overshadow the old segment.  Is that 
expected?  Is there a good way to say "replace this interval of time with data 
from this batch job, including dropping segments from time chunks if there's 
nothing there"?  We're considering using batch ingestion with the ingestSegment 
firehose and filtering in order to retain only specific rarer kinds of data 
past a certain distance in the past, and it's possible to imagine that that 
data might be missing for an entire hour here and there.
   
   Good question. I don't think we're currently supporting that kind of 
replacement. But maybe it's worth to support.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] jihoonson commented on issue #6989: Behavior of index_parallel with appendToExisting=false and no bucketIntervals in GranularitySpec is surprising

Reply via email to