[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077115#comment-14077115
 ] 

Robbie Russo commented on SPARK-1649:
-------------------------------------

Thrift also supports null values in a map and this makes any thrift generated 
parquet files that contain a map unreadable by spark sql due to the following 
code in parquet-thrift for generating the schema for maps:

{code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid}
  @Override
  public void visit(ThriftType.MapType mapType) {
    final ThriftField mapKeyField = mapType.getKey();
    final ThriftField mapValueField = mapType.getValue();

    //save env for map
    String mapName = currentName;
    Type.Repetition mapRepetition = currentRepetition;

    //=========handle key
    currentFieldPath.push(mapKeyField);
    currentName = "key";
    currentRepetition = REQUIRED;
    mapKeyField.getType().accept(this);
    Type keyType = currentType;//currentType is the already converted type
    currentFieldPath.pop();

    //=========handle value
    currentFieldPath.push(mapValueField);
    currentName = "value";
    currentRepetition = OPTIONAL;
    mapValueField.getType().accept(this);
    Type valueType = currentType;
    currentFieldPath.pop();

    if (keyType == null && valueType == null) {
      currentType = null;
      return;
    }

    if (keyType == null && valueType != null)
      throw new ThriftProjectionException("key of map is not specified in 
projection: " + currentFieldPath);

    //restore Env
    currentName = mapName;
    currentRepetition = mapRepetition;
    currentType = ConversionPatterns.mapType(currentRepetition, currentName,
            keyType,
            valueType);
  }
{code}

Which causes an error on the spark side when we reach this step in the 
toDataType function that asserts that both the key and value are of repetition 
level REQUIRED:

{code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid}
        case ParquetOriginalType.MAP => {
          assert(
            !groupType.getFields.apply(0).isPrimitive,
            "Parquet Map type malformatted: expected nested group for map!")
          val keyValueGroup = groupType.getFields.apply(0).asGroupType()
          assert(
            keyValueGroup.getFieldCount == 2,
            "Parquet Map type malformatted: nested group should have 2 (key, 
value) fields!")
          val keyType = toDataType(keyValueGroup.getFields.apply(0))
          println("here")
          assert(keyValueGroup.getFields.apply(0).getRepetition == 
Repetition.REQUIRED)
          val valueType = toDataType(keyValueGroup.getFields.apply(1))
          assert(keyValueGroup.getFields.apply(1).getRepetition == 
Repetition.REQUIRED)
          new MapType(keyType, valueType)
        }
{code}

Currently I have modified parquet-thrift to use repetition REQUIRED just to 
make spark sql able to work on the parquet files since we don't actually use 
null values in our maps. However it would be preferred to use parquet-thrift 
and spark sql out of the box and have them work nicely together with our 
existing thrift data types without having to modify dependencies.

> Figure out Nullability semantics for Array elements and Map values
> ------------------------------------------------------------------
>
>                 Key: SPARK-1649
>                 URL: https://issues.apache.org/jira/browse/SPARK-1649
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: Andre Schumacher
>            Priority: Critical
>
> For the underlying storage layer it would simplify things such as schema 
> conversions, predicate filter determination and such to record in the data 
> type itself whether a column can be nullable. So the DataType type could look 
> like like this:
> abstract class DataType(nullable: Boolean = true)
> Concrete subclasses could then override the nullable val. Mostly this could 
> be left as the default but when types can be contained in nested types one 
> could optimize for, e.g., arrays with elements that are nullable and those 
> that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to