We've been having strange crashes in Neo4j at a client site that have been 
worrying us, and I'd love to hear if they can be mitigated.

*System Spec*
Neo4j 2.1.7 on JDK 1.7.0_71, running on Windows Server 2012 R2 (32GB RAM, 1 
QuadCore CPU E3-11220 V2 @ 3.10Ghz)

*Course of Events*
Our system runs continuously, receiving data every several minutes, parsing 
the items and feeding them into Neo4j for later querying. Most nodes have a 
very large number of relationships and properties: for 4 million nodes, we 
had about 390 million relationships and 1.8 billion properties. After a few 
weeks of operation, the graph.db weighed about 150Gb - ~75Gb for the 
.strings file, ~70Gb for the propertystore and ~10Gb for the relationship 
store file. 

Slowly we noticed that performance degraded for most of our queries. Then, 
after a few days, we started getting errors when writing to the graph.db. 
They would be *TransactionFailedExceptions*, caused by 
*UnderlyingStorageExceptions*, which in turn are caused by 
*java.io.IOException: **The requested operation could not be completed due 
to a file system limitation. *(Sorry for not bringing the full stack trace, 
but the log files are at a client site and I can't remove them). This 
happened sporadically at first (followed by logfile recovery), and then 
started happening on startup, preventing the service from starting.

*Possible Causes*

I checked the usual suspects - there was plenty of room on the disk (~230Gb 
free) and on the system drive (~70Gb), none of the files were close to 
being too big for the filesystem and there weren't any quotas or 
permissions or anything else like that defined. The same drive also hosts 
heavy loads - there's a SQL Server database and a drop folder where the 
incoming data arrives (~50-100Mb/minute).

One forum post (which I can't find now, unfortunately) suggested that disk 
fragmentation could be an issue. I checked and saw that Windows' scheduled 
defrag task was never run, probably because the server is never idle long 
enough for it to start. On running it manually, it reported 100% 
fragmentation of the disk, but optimizing the 1TB disk would take >10 
hours, which was more downtime than we were willing to have, so we simply 
erased the graph.db and started a new (which allowed Neo4j to start). 
However, I'm worried that the problem will return in a few days/weeks, 
since we never addressed the root cause.

*My questions*

1. Is disk fragmentation a possible cause for these errors, and for 
preventing Neo4j from starting?
2. Would regular defrag operations prevent it from reoccuring? Other than 
performance slowdown, will the defrag operation interfere with Neo4j 
operations?


Thanks in advance,
  Avner

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to