Hi, What do you think of using df.columns to know the column names and process appropriately or df.schema?
Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Jul 5, 2016 at 7:02 AM, Scott W <defy...@gmail.com> wrote: > Hello, > > I'm processing events using Dataframes converted from a stream of JSON > events (Spark streaming) which eventually gets written out as as Parquet > format. There are different JSON events coming in so we use schema inference > feature of Spark SQL > > The problem is some of the JSON events contains spaces in the keys which I > want to log and filter/drop such events from the data frame before > converting it to Parquet because ,;{}()\n\t= are considered special > characters in Parquet schema (CatalystSchemaConverter) as listed in [1] > below and thus should not be allowed in the column names. > > How can I do such validations in Dataframe on the column names and drop such > an event altogether without erroring out the Spark Streaming job? > > [1] Spark's CatalystSchemaConverter > > def checkFieldName(name: String): Unit = { > // ,;{}()\n\t= and space are special characters in Parquet schema > checkConversionRequirement( > !name.matches(".*[ ,;{}()\n\t=].*"), > s"""Attribute name "$name" contains invalid character(s) among " > ,;{}()\\n\\t=". > |Please use alias to rename it. > """.stripMargin.split("\n").mkString(" ").trim) > } --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org