Victor Ferrer created ZEPPELIN-2496:
---------------------------------------
Summary: Error listing a HDFS directory with a large number of
files
Key: ZEPPELIN-2496
URL: https://issues.apache.org/jira/browse/ZEPPELIN-2496
Project: Zeppelin
Issue Type: Bug
Components: Interpreters
Affects Versions: 0.7.1
Environment: Centos 7 - CDH 5
Reporter: Victor Ferrer
Hi,
I have noticed an incorrect behavior while using the HDFS (%file) interpreter.
For instance, when I list this directory, I get the correct result:
{noformat}
%file
ls -l /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=1
{noformat}
{noformat}
-rw-r--r-- 3 hdfs supergroup 89376267 2017-05-03
12:29GMT
/mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=1/part-r-00000-ac4e728a-332e-40ca-b42b-daaf783fe227.snappy.parquet
-rw-r--r-- 3 hdfs supergroup 88585675 2017-05-03
12:29GMT
/mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=1/part-r-00001-ac4e728a-332e-40ca-b42b-daaf783fe227.snappy.parquet
{noformat}
However, when I switch to a bigger directory, I get an error stating that the
directory could not be found:
{noformat}
%file
ls -l /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2
{noformat}
{noformat}
Could not find file or directory:
/mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2
{noformat}
If I dig in the logs, I get this error message:
{noformat}
ERROR [2017-05-04 12:05:44,910] ({pool-2-thread-14}
HDFSFileInterpreter.java[listAll]:227) - listall: listDir
/mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2
com.google.gson.JsonSyntaxException: java.io.EOFException: End of input at line
1 column 311752
at com.google.gson.Gson.fromJson(Gson.java:800)
at com.google.gson.Gson.fromJson(Gson.java:757)
at com.google.gson.Gson.fromJson(Gson.java:706)
at com.google.gson.Gson.fromJson(Gson.java:678)
at
org.apache.zeppelin.file.HDFSFileInterpreter.listAll(HDFSFileInterpreter.java:212)
at
org.apache.zeppelin.file.FileInterpreter.interpret(FileInterpreter.java:130)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:95)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:490)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: End of input at line 1 column 311752
at
com.google.gson.stream.JsonReader.nextNonWhitespace(JsonReader.java:954)
at com.google.gson.stream.JsonReader.nextInArray(JsonReader.java:677)
at com.google.gson.stream.JsonReader.peek(JsonReader.java:376)
at com.google.gson.stream.JsonReader.hasNext(JsonReader.java:349)
at
com.google.gson.internal.bind.ArrayTypeAdapter.read(ArrayTypeAdapter.java:71)
at
com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:93)
at
com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:172)
at
com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:93)
at
com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:172)
at com.google.gson.Gson.fromJson(Gson.java:791)
... 16 more
ERROR [2017-05-04 12:05:44,911] ({pool-2-thread-14}
FileInterpreter.java[interpret]:133) - Error listing files in path
/mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2
org.apache.zeppelin.interpreter.InterpreterException: Could not find file or
directory: /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2
at
org.apache.zeppelin.file.HDFSFileInterpreter.listAll(HDFSFileInterpreter.java:228)
at
org.apache.zeppelin.file.FileInterpreter.interpret(FileInterpreter.java:130)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:95)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:490)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
I understand that the directory might be too big for the underlying REST
interface (the /webhdfs interface) but perhaps a more graceful message could be
returned, or perhaps some partial content, etc.
Cheers,
Victor
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)