[
https://issues.apache.org/jira/browse/SQOOP-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527325#comment-13527325
]
Jarek Jarcec Cecho commented on SQOOP-738:
------------------------------------------
I was able to catch up following mapper log from task that delivered empty file:
{code}
2012-12-08 18:35:28,031 WARN mapreduce.Counters: Group
org.apache.hadoop.mapred.Task$Counter is deprecated. Use
org.apache.hadoop.mapreduce.TaskCounter instead
2012-12-08 18:35:29,099 WARN org.apache.hadoop.conf.Configuration: session.id
is deprecated. Instead, use dfs.metrics.session-id
2012-12-08 18:35:29,101 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2012-12-08 18:35:39,582 INFO org.apache.hadoop.util.ProcessTree: setsid exited
with exit code 0
2012-12-08 18:35:39,587 INFO org.apache.hadoop.mapred.Task: Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@543bc20e
2012-12-08 18:35:48,655 INFO org.apache.hadoop.mapred.Task:
Task:attempt_201212071653_0005_m_000004_0 is done. And is in the process of
commiting
2012-12-08 18:35:49,787 INFO org.apache.hadoop.mapred.Task: Task
attempt_201212071653_0005_m_000004_0 is allowed to commit now
2012-12-08 18:35:49,858 INFO
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of
task 'attempt_201212071653_0005_m_000004_0' to /user/root/texts
2012-12-08 18:35:49,864 INFO org.apache.hadoop.mapred.Task: Task
'attempt_201212071653_0005_m_000004_0' done.
2012-12-08 18:35:51,445 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-12-08 18:35:51,490 ERROR
org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor: Error while loading data
out of MR job.
org.apache.sqoop.common.SqoopException: MAPRED_EXEC_0018:Error occurs during
loader run
at
org.apache.sqoop.job.etl.HdfsTextImportLoader.load(HdfsTextImportLoader.java:98)
at
org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread.run(SqoopOutputFormatLoadExecutor.java:193)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/user/root/texts/_temporary/_attempt_201212071653_0005_m_000004_0/part-m-00004
File does not exist. [Lease. Holder: DFSClient_NONMAPREDUCE_-243096719_1,
pendingcreates: 1]
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2308)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2299)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2366)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2343)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:526)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:335)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44084)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
at org.apache.hadoop.ipc.Client.call(Client.java:1160)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at $Proxy10.complete(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy10.complete(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:329)
at
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:1769)
at
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1756)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:66)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:99)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:301)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:130)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:216)
at java.io.BufferedWriter.close(BufferedWriter.java:248)
at
org.apache.sqoop.job.etl.HdfsTextImportLoader.load(HdfsTextImportLoader.java:95)
... 7 more
{code}
Please notice that Hadoop has committed the task (= move output file) before
the exception which suggest that we were still writing output at the time the
commit was happening. I believe that due to our synchronous way of moving data
from mapper (reducer) to output format, it might happen that mapper finish
before all data are written to disk. Sometimes when this happens, Hadoop will
be fast enough to call task committer that will move output data file before we
end writing, thus loosing unflushed data.
> Sqoop is not importing all data in Sqoop 2
> ------------------------------------------
>
> Key: SQOOP-738
> URL: https://issues.apache.org/jira/browse/SQOOP-738
> Project: Sqoop
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Jarek Jarcec Cecho
> Assignee: Jarek Jarcec Cecho
> Priority: Blocker
> Fix For: 1.99.1
>
>
> I've tried to import exactly 408,957 (nice rounded number right?) rows in 10
> mappers and I've noticed that not all mappers will supply all the data all
> the time. For example in run I got 6 files with expected size of 10MB whereas
> the other 4 random files are completely empty. In another run I got 8 files
> of 10MB and just 2 files empty. I did not quite found any logic regarding how
> many and which files will end up empty. We definitely need to address this.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira