Victor Ferrer created ZEPPELIN-2496: ---------------------------------------
Summary: Error listing a HDFS directory with a large number of files Key: ZEPPELIN-2496 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2496 Project: Zeppelin Issue Type: Bug Components: Interpreters Affects Versions: 0.7.1 Environment: Centos 7 - CDH 5 Reporter: Victor Ferrer Hi, I have noticed an incorrect behavior while using the HDFS (%file) interpreter. For instance, when I list this directory, I get the correct result: {noformat} %file ls -l /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=1 {noformat} {noformat} -rw-r--r-- 3 hdfs supergroup 89376267 2017-05-03 12:29GMT /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=1/part-r-00000-ac4e728a-332e-40ca-b42b-daaf783fe227.snappy.parquet -rw-r--r-- 3 hdfs supergroup 88585675 2017-05-03 12:29GMT /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=1/part-r-00001-ac4e728a-332e-40ca-b42b-daaf783fe227.snappy.parquet {noformat} However, when I switch to a bigger directory, I get an error stating that the directory could not be found: {noformat} %file ls -l /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2 {noformat} {noformat} Could not find file or directory: /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2 {noformat} If I dig in the logs, I get this error message: {noformat} ERROR [2017-05-04 12:05:44,910] ({pool-2-thread-14} HDFSFileInterpreter.java[listAll]:227) - listall: listDir /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2 com.google.gson.JsonSyntaxException: java.io.EOFException: End of input at line 1 column 311752 at com.google.gson.Gson.fromJson(Gson.java:800) at com.google.gson.Gson.fromJson(Gson.java:757) at com.google.gson.Gson.fromJson(Gson.java:706) at com.google.gson.Gson.fromJson(Gson.java:678) at org.apache.zeppelin.file.HDFSFileInterpreter.listAll(HDFSFileInterpreter.java:212) at org.apache.zeppelin.file.FileInterpreter.interpret(FileInterpreter.java:130) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:95) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:490) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException: End of input at line 1 column 311752 at com.google.gson.stream.JsonReader.nextNonWhitespace(JsonReader.java:954) at com.google.gson.stream.JsonReader.nextInArray(JsonReader.java:677) at com.google.gson.stream.JsonReader.peek(JsonReader.java:376) at com.google.gson.stream.JsonReader.hasNext(JsonReader.java:349) at com.google.gson.internal.bind.ArrayTypeAdapter.read(ArrayTypeAdapter.java:71) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:93) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:172) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:93) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:172) at com.google.gson.Gson.fromJson(Gson.java:791) ... 16 more ERROR [2017-05-04 12:05:44,911] ({pool-2-thread-14} FileInterpreter.java[interpret]:133) - Error listing files in path /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2 org.apache.zeppelin.interpreter.InterpreterException: Could not find file or directory: /mediation2/kpis/ran/cell3g/temporalAgg=ROP/year=2017/month=5/day=2 at org.apache.zeppelin.file.HDFSFileInterpreter.listAll(HDFSFileInterpreter.java:228) at org.apache.zeppelin.file.FileInterpreter.interpret(FileInterpreter.java:130) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:95) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:490) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} I understand that the directory might be too big for the underlying REST interface (the /webhdfs interface) but perhaps a more graceful message could be returned, or perhaps some partial content, etc. Cheers, Victor -- This message was sent by Atlassian JIRA (v6.3.15#6346)