[jira] [Commented] (METRON-672) SolrIndexingIntegrationTest fails intermittently

ASF GitHub Bot (JIRA) Wed, 25 Jan 2017 16:12:48 -0800

    [ 
https://issues.apache.org/jira/browse/METRON-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838867#comment-15838867
 ]


ASF GitHub Bot commented on METRON-672:
---------------------------------------

GitHub user cestella opened a pull request:

    https://github.com/apache/incubator-metron/pull/424

    METRON-672: SolrIndexingIntegrationTest fails intermittently

    This failure is due to a change in default behavior when indexing was split 
off into a separate configuration file.  The default batch size was changed 
from `5` to `1` in particular.  This, by itself, is not a problem, but the 
`IndexingIntegrationTest` (base class for Solr and Elastic search integration 
tests):
    * submits the configs
    * starts the indexing topology
    * writes the input data
    
    The writing of the input data may happen before the topology fully loads or 
the configuration fully loads, especially if the machine running the unit tests 
is under load (like with travis).  As a result, the first record may end up 
with the default batch size (of 1) and write out immediately because the 
indexing configs haven't loaded into zookeeper just yet.  In that circumstance, 
eventually the configs load and the batch size is set to `5`.  Meanwhile we've 
written 10 records and are expecting 10 in return, but because you wrote the 
first out already and then the next 5, we have another 4 pending to be written 
by the `BulkMessageWriterBolt`.
    
    So, the failure scenario is as follows:
    * Message 1 is received and the indexing config hasn't loaded yet, so the 
batch size is 1 and it immediately gets written out
    * Message 2 - 5 are each received and the indexing config has loaded, so 
the batch size is 5 and it queues up
    * Message 6 is received and the batch writes out
    * Messages 7 - 10 are received, but never make a full batch, so we time out 
waiting for them to write out
    
    The fix is to ensure that we don't write out messages to kafka until the 
configs are loaded, which is what this PR does.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron METRON-672

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/424.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #424
    
----

----


> SolrIndexingIntegrationTest fails intermittently
> ------------------------------------------------
>
>                 Key: METRON-672
>                 URL: https://issues.apache.org/jira/browse/METRON-672
>             Project: Metron
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Justin Leet
>            Assignee: Casey Stella
>
> Adapted from a dev list conversation
> h4. Initial Error in Travis
> Jon noted this in the Travis builds
> {code}
> `test(org.apache.metron.solr.integration.SolrIndexingIntegrationTest):
> Took too long to complete: 150582 > 150000`, more details below:
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> 166.167 sec <<< FAILURE!
> test(org.apache.metron.solr.integration.SolrIndexingIntegrationTest)
> Time elapsed: 166.071 sec  <<< ERROR!
> {code}
> h4. Additional Notes
> Casey was able to reproduce this locally (but not in his IDE). Couple details 
> in the dev list excerpt.
> Fixing this should ideally include adding more detailed logging to hopefully 
> avoid these issues in the future.  As a note, unfortunately in this case, 
> Casey notes that logging seems to make this issue rarer.  Still, logging to 
> be able to understand the flow (and tuning logging level as appropriate) 
> would help resolve issues in the future.
> h4. Dev List Excerpt
> Per Casey:
> {quote}
> Ok, so now I'm concerned that this isn't a fluke.  Here's an excerpt from
> the failing logs on travis for my PR with substantially longer timeouts (
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/194575474/log.txt)
> {code}
> Running org.apache.metron.solr.integration.SolrIndexingIntegrationTest
> 0 vs 10 vs 0
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed 
> target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> 317.056 sec <<< FAILURE!
> test(org.apache.metron.solr.integration.SolrIndexingIntegrationTest)
> Time elapsed: 316.949 sec  <<< ERROR!
> java.lang.RuntimeException: Took too long to complete: 300783 > 300000
>         at 
> org.apache.metron.integration.ComponentRunner.process(ComponentRunner.java:131)
>         at 
> org.apache.metron.indexing.integration.IndexingIntegrationTest.test(IndexingIntegrationTest.java:173)
> {code}
> I'm getting the impression that this isn't the timeout and we have a
> mystery on our hands.  Each of those lines "10 vs 10 vs 6" happen 15
> seconds apart.  That line means that it read 10 entries from kafka, 10
> entries from the indexed data and 6 entries from HDFS.  It's that 6
> entries that is the problem.   Also of note, this does not seem to
> happen to me locally AND it's not consistent on Travis.  Given all
> that I'd say that it's a problem with the HDFS Writer not getting
> flushed, but I verified that it is indeed flushed per message.
> One more thing, just for posterity here, it always freezes at 6 records
> written to HDFS.  That's the reason I thought it was a flushing issue.
> One thing that I would caution though is that this is likely a heisenbug.
> The more logging I added earlier made it less likely to occur. It seems
> more likely to occur on Travis than locally and I made it happen by
> repeatedly running mvn install on Metron-solr (after a mvn install of the
> whole project).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (METRON-672) SolrIndexingIntegrationTest fails intermittently

Reply via email to