Julien Genini created SPARK-10869:
-------------------------------------
Summary: Auto-normalization of semi-structured schema from a
dataframe
Key: SPARK-10869
URL: https://issues.apache.org/jira/browse/SPARK-10869
Project: Spark
Issue Type: New Feature
Components: PySpark
Affects Versions: 1.5.1
Reporter: Julien Genini
Priority: Minor
today, you can get a multi-depth schema from a semi-structured dataframe. (XML,
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.
I propose an option to add when you get the schema (linear, default False)
with the path for each field, and the list of the different node levels
df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(linear=True)
>>
{'fields': [{'metadata': {},
'name': 'BusinessDate',
'nullable': True,
'pathName': 'SiteXML.BusinessDate',
'type': 'string'},
{'metadata': {},
'name': 'Id_Group',
'nullable': True,
'pathName': 'SiteXML.Site_List.Site.Id_Group',
'type': 'string'},
{'metadata': {},
'name': 'Id_Site',
'nullable': True,
'pathName': 'SiteXML.Site_List.Site.Id_Site',
'type': 'string'},
{'metadata': {},
'name': 'label',
'nullable': True,
'pathName': 'SiteXML.Site_List.Site.label',
'type': 'string'},
{'metadata': {},
'name': 'label_group',
'nullable': True,
'pathName': 'SiteXML.Site_List.Site.label_group',
'type': 'string'},
{'metadata': {},
'name': 'TimeStamp',
'nullable': True,
'pathName': 'SiteXML.TimeStamp',
'type': 'string'}],
'nodes': [{'name': '', 'nbFields': 3},
{'name': 'SiteXML', 'nbFields': 1},
{'name': 'SiteXML.Site_List', 'nbFields': 0},
{'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]