vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader URL: https://github.com/apache/drill/pull/2023#discussion_r392392054
########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoaderImpl.java ########## @@ -0,0 +1,341 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.easy.json.loader; + +import java.io.IOException; +import java.io.InputStream; +import java.util.ArrayList; +import java.util.List; + +import org.apache.drill.common.exceptions.CustomErrorContext; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.physical.resultSet.ResultSetLoader; +import org.apache.drill.exec.physical.resultSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.store.easy.json.parser.ErrorFactory; +import org.apache.drill.exec.store.easy.json.parser.JsonStructureParser; +import org.apache.drill.exec.store.easy.json.parser.ValueDef; +import org.apache.drill.exec.store.easy.json.parser.ValueDef.JsonType; +import org.apache.drill.exec.vector.accessor.UnsupportedConversionError; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.esri.core.geometry.JsonReader; +import com.fasterxml.jackson.core.JsonParseException; +import com.fasterxml.jackson.core.JsonToken; + +/** + * Revised JSON loader that is based on the + * {@link ResultSetLoader} abstraction. Uses the listener-based + * {@link JsonStructureParser} to walk the JSON tree in a "streaming" + * fashion, calling events which this class turns into vector write + * operations. Listeners handle options such as all text mode + * vs. type-specific parsing. Think of this implementation as a + * listener-based recursive-descent parser. + * <p> + * The JSON loader mechanism runs two state machines intertwined: + * <ol> + * <li>The actual parser (to parse each JSON object, array or scalar according + * to its inferred type represented by the {@code JsonStructureParser}.</li> + * <li>The type discovery machine, which is made complex because JSON may include + * long runs of nulls, represented by this class.</li> + * </ol> + * + * <h4>Schema Discovery</h4> + * + * Fields are discovered on the fly. Types are inferred from the first JSON token + * for a field. Type inference is less than perfect: it cannot handle type changes + * such as first seeing 10, then 12.5, or first seeing "100", then 200. + * <p> + * When a field first contains null or an empty list, "null deferral" logic + * adds a special state that "waits" for an actual data type to present itself. + * This allows the parser to handle a series of nulls, empty arrays, or arrays + * of nulls (when using lists) at the start of the file. If no type ever appears, + * the loader forces the field to "text mode", hoping that the field is scalar. + * <p> + * To slightly help the null case, if the projection list shows that a column + * must be an array or a map, then that information is used to guess the type + * of a null column. + * <p> + * The code includes a prototype mechanism to provide type hints for columns. + * At present, it is just used to handle nulls that are never "resolved" by the + * end of a batch. Would be much better to use the hints (or a full schema) to + * avoid the huge mass of code needed to handle nulls. + * + * <h4>Provided Schema</h4> + * + * The JSON loader accepts a provided schema which removes type ambiguities. + * If we have the examples above (runs of nulls, or shifting types), then the + * provided schema says the vector type to create; the individual column listeners + * attempt to convert the JSON token type to the target vector type. The result + * is that, if the schema provides the correct type, the loader can ride over + * ambiguities in the input. + * + * <h4>Comparison to Original JSON Reader</h4> + * + * This class replaces the {@link JsonReader} class used in Drill versions 1.12 Review comment: Nit: ```suggestion * This class replaces the {@link JsonReader} class used in Drill versions 1.17 ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
