The NN and DNs have 600-800 files open each, and my ulimit is 1024 per process. On the machine as a whole, ls | wc -l is 1047067.
proc_nodemanager and proc_regionserver have a ton of open files: tens of thousands each. For instance, nodemanager has 1200 fds pointing to one of three different zookeeper jars. On Sun, Jul 24, 2016 at 9:49 PM, Martin Grund (Das Grundprinzip.de) <[email protected]> wrote: > One idea is to check your ulimit for file descriptors and run `lsof | grep > wc -l` to see if you for some reason exceeded the limit. Otherwise, a fresh > reboot might help to figure out if you somewhere have a spare process > hogging FDs. > > On Sun, Jul 24, 2016 at 8:09 PM Jim Apple <[email protected]> wrote: > >> The NN and DN logs are empty. >> >> I bin/kill-all.sh at the beginning of this, so I assume that nothing >> is taking them except for my little Impala work. >> >> On Sun, Jul 24, 2016 at 8:03 PM, Bharath Vissapragada >> <[email protected]> wrote: >> > Based on >> > >> > 16/07/24 18:36:08 WARN hdfs.BlockReaderFactory: I/O error constructing >> > remote block reader. >> > java.net.SocketException: Too many open files >> > >> > 16/07/24 18:36:08 WARN hdfs.DFSClient: Failed to connect to >> > /127.0.0.1:31000 for block, add to deadNodes and continue. >> > java.net.SocketException: Too many open files >> > >> > I'm guessing your hdfs instance might be overloaded (check the NN/DN >> logs). >> > HMaster is unable to connect to NN while opening regions and hence >> throwing >> > the error. >> > >> > On Mon, Jul 25, 2016 at 8:05 AM, Jim Apple <[email protected]> wrote: >> > >> >> Several thousand lines of things like >> >> >> >> WARN shortcircuit.ShortCircuitCache: ShortCircuitCache(0x419c7df4): >> >> failed to load 1073764575_BP-1490185442-127.0.0.1-1456935654337 >> >> >> >> java.lang.NullPointerException at >> >> >> >> >> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:126) >> >> ... >> >> >> >> 16/07/24 18:36:08 WARN hdfs.BlockReaderFactory: >> >> >> >> >> BlockReaderFactory(fileName=/hbase/MasterProcWALs/state-00000000000000003172.log, >> >> block=BP-1490185442-127.0.0.1-1456935654337:blk_1073764629_23805): >> >> error creating ShortCircuitReplica. >> >> >> >> java.io.EOFException: unexpected EOF while reading metadata file header >> >> >> >> 16/07/24 18:36:08 WARN hdfs.BlockReaderFactory: I/O error constructing >> >> remote block reader. >> >> java.net.SocketException: Too many open files >> >> >> >> 16/07/24 18:36:08 WARN hdfs.DFSClient: Failed to connect to >> >> /127.0.0.1:31000 for block, add to deadNodes and continue. >> >> java.net.SocketException: Too many open files >> >> >> >> 16/07/24 18:36:08 INFO hdfs.DFSClient: Could not obtain >> >> BP-1490185442-127.0.0.1-1456935654337:blk_1073764629_23805 from any >> >> node: java.io.IOException: No live nodes contain block >> >> BP-1490185442-127.0.0.1-1456935654337:blk_1073764629_23805 a >> >> fter checking nodes = >> >> [DatanodeInfoWithStorage[127.0.0.1:31000 >> >> ,DS-0232508a-5512-4827-bcaf-c922f1e65eb1,DISK]], >> >> ignoredNodes = null No live nodes contain current block Block >> >> locations: DatanodeInfoWithStorage[127.0.0.1:31000,DS-0232508a-551 >> >> 2-4827-bcaf-c922f1e65eb1,DISK] Dead nodes: >> >> DatanodeInfoWithStorage[127.0.0.1:31000 >> >> ,DS-0232508a-5512-4827-bcaf-c922f1e65eb1,DISK]. >> >> Will get new block locations from namenode and retry... >> >> 16/07/24 18:36:08 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1 >> >> IOException, will wait for 2772.7114628272548 msec. >> >> 16/07/24 18:36:11 WARN hdfs.BlockReaderFactory: >> >> >> >> >> BlockReaderFactory(fileName=/hbase/MasterProcWALs/state-00000000000000003172.log, >> >> block=BP-1490185442-127.0.0.1-1456935654337:blk_1073764629_23805): >> >> error creating ShortCircuitReplica. >> >> java.io.IOException: Illegal seek >> >> at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) >> >> >> >> On Sun, Jul 24, 2016 at 7:24 PM, Bharath Vissapragada >> >> <[email protected]> wrote: >> >> > Do you see something in the HMaster log? From the error it looks like >> the >> >> > Hbase master hasn't started properly for some reason. >> >> > >> >> > On Mon, Jul 25, 2016 at 6:08 AM, Jim Apple <[email protected]> >> wrote: >> >> > >> >> >> I tried reloading the data with >> >> >> >> >> >> ./bin/load-data.py --workloads functional-query >> >> >> >> >> >> but that gave errors like >> >> >> >> >> >> Executing HBase Command: hbase shell >> >> >> load-functional-query-core-hbase-generated.create >> >> >> 16/07/24 17:19:39 INFO Configuration.deprecation: hadoop.native.lib >> is >> >> >> deprecated. Instead, use io.native.lib.available >> >> >> SLF4J: Class path contains multiple SLF4J bindings. >> >> >> SLF4J: Found binding in >> >> >> >> >> >> >> >> >> [jar:file:/opt/Impala-Toolchain/cdh_components/hbase-1.2.0-cdh5.9.0-SNAPSHOT/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> >> >> SLF4J: Found binding in >> >> >> >> >> >> >> >> >> [jar:file:/opt/Impala-Toolchain/cdh_components/hadoop-2.6.0-cdh5.9.0-SNAPSHOT/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> >> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >> >> >> explanation. >> >> >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] >> >> >> >> >> >> ERROR: Can't get the locations >> >> >> >> >> >> Here is some help for this command: >> >> >> Start disable of named table: >> >> >> hbase> disable 't1' >> >> >> hbase> disable 'ns1:t1' >> >> >> >> >> >> ERROR: Can't get master address from ZooKeeper; znode data == null >> >> >> >> >> >> On Sun, Jul 24, 2016 at 5:12 PM, Jim Apple <[email protected]> >> >> wrote: >> >> >> > I'm having trouble with my HBase environment, and it's preventing >> me >> >> >> > from running bin/run-all-tests.sh. I am on Ubuntu 14.04. I have >> tried >> >> >> > this with a clean build, and I have tried unset LD_LIBRARY_PATH && >> >> >> > bin/impala-config.sh, and I have tried ./testdata/bin/run-all.sh >> >> >> > >> >> >> > Here is the error I get from compute stats: >> >> >> > (./testdata/bin/compute-table-stats.sh) >> >> >> > >> >> >> > Executing: compute stats functional_hbase.alltypessmall >> >> >> > -> Error: ImpalaBeeswaxException: >> >> >> > Query aborted:RuntimeException: couldn't retrieve HBase table >> >> >> > (functional_hbase.alltypessmall) info: >> >> >> > Unable to find region for in functional_hbase.alltypessmall after >> 35 >> >> >> tries. >> >> >> > CAUSED BY: NoServerForRegionException: Unable to find region for >> in >> >> >> > functional_hbase.alltypessmall after 35 tries. >> >> >> > >> >> >> > Here is a snippet of the error in ./testdata/bin/split-hbase.sh >> >> >> > >> >> >> > Sun Jul 24 15:24:52 PDT 2016, >> >> >> > RpcRetryingCaller{globalStartTime=1469399003900, pause=100, >> >> >> > retries=31}, org.apache.hadoop.hbase.MasterNotRunningException: >> >> >> > com.google.protobuf.ServiceException: >> >> >> > >> >> >> >> >> >> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException): >> >> >> > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is >> >> >> > not running yet >> >> >> > >> >> >> > I tried ./bin/create_testdata.sh, but that exited almost >> immediately >> >> >> > with no error. >> >> >> > >> >> >> > Has anyone else seen and solved this before? >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > Thanks, >> >> > Bharath >> >> >> > >> > >> > >> > -- >> > Thanks, >> > Bharath >>
