Tim Owen created SOLR-9389:
------------------------------

             Summary: HDFS Transaction logs stay open for writes which leaks 
Xceivers
                 Key: SOLR-9389
                 URL: https://issues.apache.org/jira/browse/SOLR-9389
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Hadoop Integration, hdfs
    Affects Versions: 6.1, master (7.0)
            Reporter: Tim Owen


The HdfsTransactionLog implementation keeps a Hadoop FSDataOutputStream open 
for its whole lifetime, which consumes two threads on the HDFS data node server 
(dataXceiver and packetresponder) even once the Solr tlog has finished being 
written to.

This means for a cluster with many indexes on HDFS, the number of Xceivers can 
keep growing and eventually hit the limit of 4096 on the data nodes. It's 
especially likely for indexes that have low write rates, because Solr keeps 
enough tlogs around to contain 100 documents (up to a limit of 10 tlogs). 
There's also the issue that attempting to write to a finished tlog would be a 
major bug, so closing it for writes helps catch that.

Our cluster during testing had 100+ collections with 100 shards each, spread 
across 40 boxes (each running 4 solr nodes and 1 hdfs data node) and with 3x 
replication for the tlog files, this meant we hit the xceiver limit fairly 
easily and had to use the attached patch to ensure tlogs were closed for writes 
once finished.

The patch introduces an extra lifecycle state for the tlog, so it can be closed 
for writes and free up the HDFS resources, while still being available for 
reading. I've tried to make it as unobtrusive as I could, but there's probably 
a better way. I have not changed the behaviour of the local disk tlog 
implementation, because it only consumes a file descriptor regardless of read 
or write.

nb We have decided not to use Solr-on-HDFS now, we're using local disk (for 
various reasons). So I don't have a HDFS cluster to do further testing on this, 
I'm just contributing the patch which worked for us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to