Hi,

What do you think of using df.columns to know the column names and
process appropriately or df.schema?

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jul 5, 2016 at 7:02 AM, Scott W <defy...@gmail.com> wrote:
> Hello,
>
> I'm processing events using Dataframes converted from a stream of JSON
> events (Spark streaming) which eventually gets written out as as Parquet
> format. There are different JSON events coming in so we use schema inference
> feature of Spark SQL
>
> The problem is some of the JSON events contains spaces in the keys which I
> want to log and filter/drop such events from the data frame before
> converting it to Parquet because ,;{}()\n\t= are considered special
> characters in Parquet schema (CatalystSchemaConverter) as listed in [1]
> below and thus should not be allowed in the column names.
>
> How can I do such validations in Dataframe on the column names and drop such
> an event altogether without erroring out the Spark Streaming job?
>
> [1] Spark's CatalystSchemaConverter
>
> def checkFieldName(name: String): Unit = {
>     // ,;{}()\n\t= and space are special characters in Parquet schema
>     checkConversionRequirement(
>       !name.matches(".*[ ,;{}()\n\t=].*"),
>       s"""Attribute name "$name" contains invalid character(s) among "
> ,;{}()\\n\\t=".
>          |Please use alias to rename it.
>        """.stripMargin.split("\n").mkString(" ").trim)
>   }

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to