[ 
https://issues.apache.org/jira/browse/HDDS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Rose resolved HDDS-4092.
------------------------------
    Resolution: Information Provided

> Writing delta to Ozone hangs when creating the _delta_log json
> --------------------------------------------------------------
>
>                 Key: HDDS-4092
>                 URL: https://issues.apache.org/jira/browse/HDDS-4092
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Filesystem
>    Affects Versions: 0.5.0
>         Environment: We are using Kubernetes k8s, Ozone 0.5.0beta, Spark 
> 3.0.0, Hadoop 3.2, Scala 2.12.10, and io.delta:delta-core_2.12:0.7.0
>            Reporter: Dustin Smith
>            Assignee: Marton Elek
>            Priority: Major
>              Labels: delta, filesystem, scala, spark
>
> I am testing writing delta, OSS not databricks, data to Ozone FS since my 
> company is looking to replace Hadoop if feasible. However, whenever I write 
> delta table, the parquet files are writing, the delta log directory is 
> created, but the json is never writing. 
> I am using the spark operator to submit a batch test job to write about 5mb 
> of data.
> Neither on the driver nor on the executor is there an error. The driver never 
> finishes since the creation of the json hangs.
>  
> Code I used for testing spark operator and then I ran the pieces in the shell 
> for testing. In the save path, update bucket and volume info for your data 
> store.
> {code:java}
> package app.OzoneTest
> import org.apache.spark.sql.{DataFrame, SparkSession}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types.{BinaryType, StringType}
> object CreateData {
>   def main(args: Array[String]): Unit = {
>     val spark: SparkSession = SparkSession
>       .builder()
>       .appName(s"Create Ozone Mock Data")
>       .enableHiveSupport()
>       .getOrCreate()
>     import spark.implicits._
>     val df: DataFrame = Seq.fill(100000)
>     {(randomID, randomLat, randomLong, randomDates, randomHour)}
>       .toDF("msisdn", "latitude", "longitude", "par_day", "par_hour")
>       .withColumn("msisdn", $"msisdn".cast(StringType))
>       .withColumn("msisdn", sha1($"msisdn".cast(BinaryType)))
>       .select("msisdn", "latitude", "longitude", "par_day", "par_hour")
>     df
>       .repartition(3, $"msisdn")
>       .sortWithinPartitions("latitude", "longitude")
>       .write
>       .partitionBy("par_day", "par_hour")
>       .format("delta")
>       .save("o3fs://your_bucker.your_volume/location_data")
>   }
>   def randomID: Int = scala.util.Random.nextInt(10) + 1
>   def randomDates: Int = 20200101 + scala.util.Random.nextInt((20200131 - 
> 20200101) + 1)
>   def randomHour: Int = scala.util.Random.nextInt(24)
>   def randomLat: Double = 13.5 + scala.util.Random.nextFloat()
>   def randomLong: Double = 100 + scala.util.Random.nextFloat()
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to