[
https://issues.apache.org/jira/browse/HDDS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Rose resolved HDDS-4092.
------------------------------
Resolution: Information Provided
> Writing delta to Ozone hangs when creating the _delta_log json
> --------------------------------------------------------------
>
> Key: HDDS-4092
> URL: https://issues.apache.org/jira/browse/HDDS-4092
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Filesystem
> Affects Versions: 0.5.0
> Environment: We are using Kubernetes k8s, Ozone 0.5.0beta, Spark
> 3.0.0, Hadoop 3.2, Scala 2.12.10, and io.delta:delta-core_2.12:0.7.0
> Reporter: Dustin Smith
> Assignee: Marton Elek
> Priority: Major
> Labels: delta, filesystem, scala, spark
>
> I am testing writing delta, OSS not databricks, data to Ozone FS since my
> company is looking to replace Hadoop if feasible. However, whenever I write
> delta table, the parquet files are writing, the delta log directory is
> created, but the json is never writing.
> I am using the spark operator to submit a batch test job to write about 5mb
> of data.
> Neither on the driver nor on the executor is there an error. The driver never
> finishes since the creation of the json hangs.
>
> Code I used for testing spark operator and then I ran the pieces in the shell
> for testing. In the save path, update bucket and volume info for your data
> store.
> {code:java}
> package app.OzoneTest
> import org.apache.spark.sql.{DataFrame, SparkSession}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types.{BinaryType, StringType}
> object CreateData {
> def main(args: Array[String]): Unit = {
> val spark: SparkSession = SparkSession
> .builder()
> .appName(s"Create Ozone Mock Data")
> .enableHiveSupport()
> .getOrCreate()
> import spark.implicits._
> val df: DataFrame = Seq.fill(100000)
> {(randomID, randomLat, randomLong, randomDates, randomHour)}
> .toDF("msisdn", "latitude", "longitude", "par_day", "par_hour")
> .withColumn("msisdn", $"msisdn".cast(StringType))
> .withColumn("msisdn", sha1($"msisdn".cast(BinaryType)))
> .select("msisdn", "latitude", "longitude", "par_day", "par_hour")
> df
> .repartition(3, $"msisdn")
> .sortWithinPartitions("latitude", "longitude")
> .write
> .partitionBy("par_day", "par_hour")
> .format("delta")
> .save("o3fs://your_bucker.your_volume/location_data")
> }
> def randomID: Int = scala.util.Random.nextInt(10) + 1
> def randomDates: Int = 20200101 + scala.util.Random.nextInt((20200131 -
> 20200101) + 1)
> def randomHour: Int = scala.util.Random.nextInt(24)
> def randomLat: Double = 13.5 + scala.util.Random.nextFloat()
> def randomLong: Double = 100 + scala.util.Random.nextFloat()
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]