[
https://issues.apache.org/jira/browse/NUTCH-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989519#comment-16989519
]
Lucas Pauchard commented on NUTCH-2756:
---------------------------------------
Hi, thanks for your fast response
Here's the details you ask for:
{panel:title=Hadoop version}
Hadoop 3.1.1
Source code repository [https://github.com/apache/hadoop] -r
2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c
Compiled by leftnoteasy on 2018-08-02T04:26Z
Compiled with protoc 2.5.0
From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0
This command was run using
/usr/local/hadoop/share/hadoop/common/hadoop-common-3.1.1.jar
{panel}
For the HDFS configuration, I'm not sure which file do you really need. So I
put in the attachements the "hdfs-site.xml" and also the configuration files we
changed because of memory issues we had. We did also changes in the log4j file
but I don't think this file is important.
[^hdfs-site.xml] [^hadoop-env.sh] [^mapred-site.xml]
[^yarn-env.sh] [^yarn-site.xml]
Unfortunately, we don't keep the logs from the job more than 2 days, so I can't
give you them. But today we had again the same problem and here is the logs:
{panel:title=job Parser logs}
2019-12-06 06:34:45,903 INFO parse.ParseSegment: ParseSegment: starting at
2019-12-06 06:34:45
2019-12-06 06:34:45,917 INFO parse.ParseSegment: ParseSegment: segment:
crawloneokhttp/segment/20191206055043
2019-12-06 06:34:45,994 INFO client.RMProxy: Connecting to ResourceManager at
jobmaster/79.137.20.6:8032
2019-12-06 06:34:46,223 INFO mapreduce.JobResourceUploader: Disabling Erasure
Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1575565527267_0231
2019-12-06 06:35:00,461 INFO input.FileInputFormat: Total input files to
process : 6
2019-12-06 06:35:00,583 INFO mapreduce.JobSubmitter: number of splits:6
2019-12-06 06:35:00,686 INFO Configuration.deprecation:
yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead,
use yarn.system-metrics-publisher.enabled
2019-12-06 06:35:00,796 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1575565527267_0231
2019-12-06 06:35:00,797 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-12-06 06:35:00,927 INFO conf.Configuration: resource-types.xml not found
2019-12-06 06:35:00,928 INFO resource.ResourceUtils: Unable to find
'resource-types.xml'.
2019-12-06 06:35:00,976 INFO impl.YarnClientImpl: Submitted application
application_1575565527267_0231
2019-12-06 06:35:01,006 INFO mapreduce.Job: The url to track the job:
[http://x.x.x.x:y/proxy/application_1575565527267_0231/|http://jobmaster:8088/proxy/application_1575565527267_0231/]
2019-12-06 06:35:01,007 INFO mapreduce.Job: Running job: job_1575565527267_0231
2019-12-06 06:36:04,205 INFO mapreduce.Job: Job job_1575565527267_0231 running
in uber mode : false
2019-12-06 06:36:04,207 INFO mapreduce.Job: map 0% reduce 0%
2019-12-06 06:36:33,548 INFO mapreduce.Job: map 19% reduce 0%
2019-12-06 06:36:35,670 INFO mapreduce.Job: map 33% reduce 0%
2019-12-06 06:36:36,675 INFO mapreduce.Job: map 41% reduce 0%
2019-12-06 06:36:39,688 INFO mapreduce.Job: map 60% reduce 0%
2019-12-06 06:36:40,692 INFO mapreduce.Job: map 78% reduce 0%
2019-12-06 06:36:41,697 INFO mapreduce.Job: map 85% reduce 0%
2019-12-06 06:36:42,702 INFO mapreduce.Job: map 93% reduce 0%
2019-12-06 06:36:43,706 INFO mapreduce.Job: map 100% reduce 0%
2019-12-06 06:36:49,727 INFO mapreduce.Job: map 100% reduce 33%
2019-12-06 06:37:00,763 INFO mapreduce.Job: map 100% reduce 83%
2019-12-06 06:37:01,767 INFO mapreduce.Job: map 100% reduce 100%
2019-12-06 06:37:01,772 INFO mapreduce.Job: Job job_1575565527267_0231
completed successfully
2019-12-06 06:37:01,850 INFO mapreduce.Job: Counters: 57
File System Counters
FILE: Number of bytes read=108082746
FILE: Number of bytes written=219258714
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=115122772
HDFS: Number of bytes written=44282864
HDFS: Number of read operations=30
HDFS: Number of large read operations=0
HDFS: Number of write operations=42
Job Counters
Killed map tasks=1
Killed reduce tasks=1
Launched map tasks=6
Launched reduce tasks=7
Data-local map tasks=4
Rack-local map tasks=2
Total time spent by all maps in occupied slots (ms)=203550
Total time spent by all reduces in occupied slots (ms)=117558
Total time spent by all map tasks (ms)=203550
Total time spent by all reduce tasks (ms)=117558
Total vcore-milliseconds taken by all map tasks=203550
Total time spent by all reduce tasks (ms)=117558
Total vcore-milliseconds taken by all map tasks=203550
Total vcore-milliseconds taken by all reduce tasks=117558
Total megabyte-milliseconds taken by all map tasks=1250611200
Total megabyte-milliseconds taken by all reduce tasks=722276352
Map-Reduce Framework
Map input records=13798
Map output records=13798
Map output bytes=108027516
Map output materialized bytes=108082926
Input split bytes=972
Combine input records=0
Combine output records=0
Reduce input groups=13798
Reduce shuffle bytes=108082926
Reduce input records=13798
Reduce output records=13798
Spilled Records=27596
Shuffled Maps =36
Failed Shuffles=0
Merged Map outputs=36
GC time elapsed (ms)=2638
CPU time spent (ms)=184400
Physical memory (bytes) snapshot=14585151488
Virtual memory (bytes) snapshot=63409967104
Total committed heap usage (bytes)=22992650240
Peak Map Physical memory (bytes)=1402261504
Peak Map Virtual memory (bytes)=5282394112
Peak Reduce Physical memory (bytes)=1077035008
Peak Reduce Virtual memory (bytes)=5297762304
ParserStatus
success=13798
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=115121800
File Output Format Counters
Bytes Written=0
2019-12-06 06:37:01,853 INFO parse.ParseSegment: ParseSegment: finished at
2019-12-06 06:37:01, elapsed: 00:02:15
{panel}
Here's more logs from the namenode logs
{panel:title=Logs creation and closing of the segment file}
2019-12-04 22:22:54,768 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202201_461385, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000
2019-12-04 22:22:54,909 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202202_461386, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/data
2019-12-04 22:22:54,934 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202203_461387, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/data
2019-12-04 22:23:00,489 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/data
is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1
2019-12-04 22:23:00,493 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202204_461388, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/index
2019-12-04 22:23:00,506 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/index
is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1
2019-12-04 22:23:00,515 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/data
is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1
2019-12-04 22:23:00,517 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202205_461389, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/index
2019-12-04 22:23:01,558 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/index
is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1
2019-12-04 22:23:01,563 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK*
blk_1074202201_461385 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1)
in file
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000
2019-12-04 22:23:01,964 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000
is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1
2019-12-04 22:23:06,437 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202206_461390, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
2019-12-04 22:23:06,703 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202207_461391, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data
2019-12-04 22:23:06,722 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202208_461392, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data
2019-12-04 22:23:10,444 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202209_461393, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00001
2019-12-04 22:23:10,650 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202210_461394, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/data
2019-12-04 22:23:10,684 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202211_461395, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/data
2019-12-04 22:23:10,698 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202212_461396, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002
2019-12-04 22:23:10,896 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202213_461397, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/data
2019-12-04 22:23:10,933 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202214_461398, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/data
2019-12-04 22:23:11,790 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202215_461399, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00003
2019-12-04 22:23:11,814 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data
is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1
2019-12-04 22:23:11,819 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202216_461400, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index
2019-12-04 22:23:11,838 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index
is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1
2019-12-04 22:23:11,843 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data
is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1
2019-12-04 22:23:11,846 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202217_461401, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
2019-12-04 22:23:11,888 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1
2019-12-04 22:23:11,896 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1
2019-12-04 22:23:12,027 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202218_461402, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/data
2019-12-04 22:23:12,060 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202219_461403, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index
2019-12-04 22:23:12,064 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202220_461404, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/data
2019-12-04 22:23:12,221 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index
is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1
2019-12-04 22:23:12,229 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202221_461405, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data
2019-12-04 22:23:12,258 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data
is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1
2019-12-04 22:23:12,284 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202222_461406, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data
2019-12-04 22:23:12,298 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data
is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1
2019-12-04 22:23:13,521 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202223_461407, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00005
2019-12-04 22:23:13,681 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202224_461408, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/data
2019-12-04 22:23:13,717 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202225_461409, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/data
2019-12-04 22:23:15,160 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/data
is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1
2019-12-04 22:23:15,164 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202226_461410, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/index
2019-12-04 22:23:15,199 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/index
is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1
2019-12-04 22:23:15,204 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/data
is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1
2019-12-04 22:23:15,207 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202227_461411, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/index
2019-12-04 22:23:15,216 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/index
is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1
2019-12-04 22:23:15,222 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00001
is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1
2019-12-04 22:23:15,312 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/data
is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1
2019-12-04 22:23:15,317 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202228_461412, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/index
2019-12-04 22:23:15,337 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/index
is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1
2019-12-04 22:23:15,344 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/data
is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1
2019-12-04 22:23:15,347 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202229_461413, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/index
2019-12-04 22:23:15,367 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/index
is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1
2019-12-04 22:23:15,372 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK*
blk_1074202212_461396 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1)
in file
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002
2019-12-04 22:23:15,774 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002
is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1
2019-12-04 22:23:16,276 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/data
is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1
2019-12-04 22:23:16,289 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202230_461414, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/index
2019-12-04 22:23:16,298 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/index
is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1
2019-12-04 22:23:16,303 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/data
is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1
2019-12-04 22:23:16,306 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202231_461415, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/index
2019-12-04 22:23:16,316 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/index
is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1
2019-12-04 22:23:16,321 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00003
is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1
2019-12-04 22:23:18,100 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/data
is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1
2019-12-04 22:23:18,104 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202232_461416, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/index
2019-12-04 22:23:18,113 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/index
is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1
2019-12-04 22:23:18,118 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/data
is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1
2019-12-04 22:23:18,120 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1074202233_461417, replicas=x.x.x.x:y, x.x.x.x:y for
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/index
2019-12-04 22:23:18,130 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/index
is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1
2019-12-04 22:23:18,135 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00005
is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1
{panel}
Tell me if you need more informations.
> Segment Part problem with HDFS on distibuted mode
> -------------------------------------------------
>
> Key: NUTCH-2756
> URL: https://issues.apache.org/jira/browse/NUTCH-2756
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.15
> Reporter: Lucas Pauchard
> Priority: Major
> Attachments: 0_byte_file_screenshot.PNG, hadoop-env.sh,
> hdfs-site.xml, mapred-site.xml, yarn-env.sh, yarn-site.xml
>
>
> During the parsing, it happens sometimes that parts of the data on the HDFS
> is missing after the parsing.
> When I take a look at our HDFS, I've got this file with 0 bytes (see
> attachments).
> After that the CrawlDB complains about this specific (corrupted?) part:
> {panel:title=log_crawl}
> 2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id :
> attempt_1575479127636_0047_m_000017_2, Status : FAILED
> Error: java.io.EOFException:
> hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
> not a SequenceFile
> at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
> at
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886)
> at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~
> {panel}
> When I check the namenode logs, I don't see any error during the writing of
> the segment part but one hour later, I've got the following log:
> {panel:title=log_namenode}
> 2019-12-04 23:23:13,750 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending
> creates: 2],
> src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
> 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK*
> internalReleaseLease: All existing blocks are COMPLETE, lease removed, file
> /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
> closed.
> 2019-12-04 23:23:13,750 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending
> creates: 1],
> src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
> 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK*
> internalReleaseLease: All existing blocks are COMPLETE, lease removed, file
> /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
> closed.
> {panel}
> This issue is hard to reproduce and I can't figure out what are the
> preconditions. It seems that it just happens randomly.
> Maybe the problem is coming from a bad management when we close the file.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)