[
https://issues.apache.org/jira/browse/DRILL-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990713#comment-16990713
]
ASF GitHub Bot commented on DRILL-6953:
---------------------------------------
cgivre commented on pull request #1913: DRILL-6953: EVF-based version of the
JSON reader
URL: https://github.com/apache/drill/pull/1913#discussion_r355159516
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JsonBatchReader.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.List;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.exec.ops.OperatorContext;
+import
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.ResultVectorCache;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.server.options.OptionSet;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.easy.json.parser.JsonLoaderImpl;
+import org.apache.drill.exec.store.easy.json.parser.JsonLoaderImpl.JsonOptions;
+import
org.apache.drill.exec.store.easy.json.parser.JsonLoaderImpl.TypeNegotiator;
+import org.apache.hadoop.mapred.FileSplit;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JsonBatchReader implements ManagedReader<FileSchemaNegotiator> {
+
+ private static final Logger logger =
LoggerFactory.getLogger(JsonBatchReader.class);
+
+ private DrillFileSystem fileSystem;
+ private FileSplit split;
+ private final JsonOptions options;
+ private JsonLoader jsonLoader;
+ private InputStream stream;
+
+ private RowSetLoader tableLoader;
+
+ public JsonBatchReader(JsonOptions options) {
+ this.options = options == null ? new JsonOptions() : options;
+ }
+
+ @Override
+ public boolean open(FileSchemaNegotiator negotiator) {
+ this.fileSystem = negotiator.fileSystem();
+ this.split = negotiator.split();
+ OperatorContext opContext = negotiator.context();
+ OptionSet optionMgr = opContext.getFragmentContext().getOptions();
+ Object embeddedContent = null;
+ options.allTextMode = embeddedContent == null &&
optionMgr.getBoolean(ExecConstants.JSON_ALL_TEXT_MODE);
+ options.readNumbersAsDouble = embeddedContent == null &&
optionMgr.getBoolean(ExecConstants.JSON_READ_NUMBERS_AS_DOUBLE);
+ options.unionEnabled = embeddedContent == null &&
optionMgr.getBoolean(ExecConstants.ENABLE_UNION_TYPE_KEY);
+ options.skipMalformedRecords =
optionMgr.getBoolean(ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG);
+ // Printing of malformed records is always enabled.
+// options.printSkippedMalformedJSONRecordLineNumber =
optionMgr.getBoolean(ExecConstants.JSON_READER_PRINT_INVALID_RECORDS_LINE_NOS_FLAG);
+ options.allowNanInf = true;
+
+ try {
+ stream = fileSystem.openPossiblyCompressedStream(split.getPath());
+ } catch (IOException e) {
+ throw UserException
+ .dataReadError(e)
+ .addContext("Failure to open JSON file", split.getPath().toString())
+ .build(logger);
+ }
+ final ResultSetLoader rsLoader = negotiator.build();
+ tableLoader = rsLoader.writer();
+ RowSetLoader rootWriter = tableLoader;
+
+ // Bind the type negotiator that will resolve ambiguous types
+ // using information from any previous readers in this scan.
+
+ options.typeNegotiator = new TypeNegotiator() {
+ @Override
+ public MajorType typeOf(List<String> path) {
+ ResultVectorCache cache = rsLoader.vectorCache();
+ for (int i = 0; i < path.size() - 1; i++) {
+ cache = cache.childCache(path.get(i));
+ }
+ return cache.getType(path.get(path.size()-1));
+ }
+ };
+
+ // Create the JSON loader (high-level parser).
+
+ jsonLoader = new JsonLoaderImpl(stream, rootWriter, options);
Review comment:
@paul-rogers
Would it be possible for you to add some logic here so that the
JSONBatchReader can accept an `InputStream` as well as file input? This could
be useful for other storage plugins that output JSON.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Merge row set-based JSON reader
> -------------------------------
>
> Key: DRILL-6953
> URL: https://issues.apache.org/jira/browse/DRILL-6953
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.15.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
> Labels: doc-impacting
> Fix For: 1.18.0
>
>
> The final step in the ongoing "result set loader" saga is to merge the
> revised JSON reader into master. This reader does two key things:
> * Demonstrates the prototypical "late schema" style of data reading (discover
> schema while reading).
> * Implements many tricks and hacks to handle schema changes while loading.
> * Shows that, even with all these tricks, the only true solution is to
> actually have a schema.
> The new JSON reader:
> * Uses an expanded state machine when parsing rather than the complex set of
> if-statements in the current version.
> * Handles reading a run of nulls before seeing the first data value (as long
> as the data value shows up in the first record batch).
> * Uses the result-set loader to generate fixed-size batches regardless of the
> complexity, depth of structure, or width of variable-length fields.
> While the JSON reader itself is helpful, the key contribution is that it
> shows how to use the entire kit of parts: result set loader, projection
> framework, and so on. Since the projection framework can handle an external
> schema, it is also a handy foundation for the ongoing schema project.
> Key work to complete after this merger will be to reconcile actual data with
> the external schema. For example, if we know a column is supposed to be a
> VarChar, then read the column as a VarChar regardless of the type JSON itself
> picks. Or, if a column is supposed to be a Double, then convert Int and
> String JSON values into Doubles.
> The Row Set framework was designed to allow inserting custom column writers.
> This would be a great opportunity to do the work needed to create them. Then,
> use the new JSON framework to allow parsing a JSON field as a specified Drill
> type.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)