Alexander Belyak created IGNITE-11783:
-----------------------------------------
Summary: Open file limit for deb distribution
Key: IGNITE-11783
URL: https://issues.apache.org/jira/browse/IGNITE-11783
Project: Ignite
Issue Type: Bug
Components: persistence
Affects Versions: 2.7
Environment: ubuntu-16.04
Reporter: Alexander Belyak
Step to reproduce:
1) Install ignite from deb package on ubuntu 16.04
2) Start with persistence
3) Create 5 caches (or one with 4000+ partitions)
Error text:
{noformat}
[18:29:44,369][INFO][exchange-worker-#43][GridCacheDatabaseSharedManager]
Restoring partition state for local groups [cntPartStateWal=0,
lastCheckpointId=bd24ff23-da6f-46e5-bafd-b643db3870d4]
[18:29:51,864][SEVERE][exchange-worker-#43][] Critical system error detected.
Will be handled accordingly to configured handler
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureH
andler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]],
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
o.a.i.i.processors.cache.persistence.StorageException: Failed to initialize
partition file: /usr/s
hare/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_VERTEX_TBL/part-913.bin]]
class org.apache.ignite.internal.processors.cache.persistence.StorageException:
Failed to initialize partition file:
/usr/share/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_
VERTEX_TBL/part-913.bin
at
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:444)
at
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.ensure(FilePageStore.java:650)
at
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.ensure(FilePageStoreManager.java:712)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2472)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLastUpdates(GridCacheDatabaseSharedManager.java:2419)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreState(GridCacheDatabaseSharedManager.java:1628)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1302)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1453)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:806)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2667)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2539)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.FileSystemException:
/usr/share/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_VERTEX_TBL/part-913.bin:
Too many open files
at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at
sun.nio.fs.UnixFileSystemProvider.newAsynchronousFileChannel(UnixFileSystemProvider.java:196)
at
java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:248)
at
java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:301)
at
org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.<init>(AsyncFileIO.java:57)
at
org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory.create(AsyncFileIOFactory.java:53)
at
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:416)
... 12 more
{noformat}
It happen because systemd service description
(/etc/systemd/system/[email protected]) didn't contain
{noformat}
LimitNOFILE=500000
(possible with) LimitNPROC=500000
{noformat}
see: https://fredrikaverpil.github.io/2016/04/27/systemd-and-resource-limits/
Possible, installation script should also add:
* "fs.file-max = 2097152" to "/etc/sysctl.conf"
* into /etc/security/limits.conf:
{noformat}
* hard nofile 500000
* soft nofile 500000
root hard nofile 500000
root soft nofile 500000
{noformat}
see: https://easyengine.io/tutorials/linux/increase-open-files-limit
And it will be amazing if ignite start process check file limits and print link
to documentation page if:
1) persistence enabled
2) limits below some value (<=4096)
3) limits below total number of partition in current node
And one more thing - if ignite get "Too many open files" exception in the
middle of rebalancing - it will be terrible situation, whole cluster just stop
working. It can happen if each node have almost full limit and:
* someone create additional cache
* topology change (remove node) and each remaining nodes get more local
partition.
Can we remember limit on startup and check limit each time when are we going to
create local partition?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)