Lei Rui created IOTDB-92:
----------------------------
Summary: The data locality principle used by Spark loses ground in
the face of TsFile.
Key: IOTDB-92
URL: https://issues.apache.org/jira/browse/IOTDB-92
Project: Apache IoTDB
Issue Type: Improvement
Reporter: Lei Rui
In the development of TsFile-Spark-Connector, we discover the problem that
the data locality principle used by Spark loses ground in the face of TsFile.
We believe the problem is rooted in the storage structure design of TsFile. Our
latest implementation of TsFile-Spark-Connector finds a way to guarantee the
proper functionality despite the constraint. The resolvement of the data
locality problem is left for future work. Below are the details.
h1. 1. Spark Partition
In Apache Spark, the data is stored in the form of RDDs and divided into
partitions across various nodes. A partition is a logical chunk of a large
distributed data set that helps parallelize distributed data processing. Spark
works on data locality principle to minimize the network traffic for sending
data between executors.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
https://techvidvan.com/tutorials/spark-partition/
h1. 2. TsFile Structure
TsFile is a columnar storage file format designed for time series data,
which supports efficient compression and query. Data in TsFile are organized by
device-measurement hierarchy. As the figure below shows, the storage unit of a
device is a chunk group and that of a measurement is a chunk. Measurements of a
device are grouped together.
Under this architecture, different partitions of Spark logically contain
different sets of chunkgroups of TsFile. Now consider the query process of this
TsFile on Spark. Supposing that we query “select * from root where d1.s6<2.5
and d2.s1>10”, a scheduled task of a partition has to deal with the whole data
to get the right answer. However, this also means that it is nearly impossible
to apply the data locality principle.
h1. 3. Problem
Now we can summarize two problems.
The first problem is how to guarantee the correctness of the queried answer
integrated from all the partition task without changing the current storage
structure of TsFile. To solve this problem, we propose a solution by converting
the space partition constraint to the time partition constraint while still
requiring a single task to have access to the whole TsFile data. As shown in
the figure below, the task of partition 1 is assigned the yellow marked time
partition constraint; the task of partition 2 is assigned the green marked time
partition constraint; the task of partition 3 is assigned empty time partition
constraint because the former two tasks have completed the query.
The second problem is more fundamental. That is, how we can adjust to enable
some extent of data locality of TsFile when it is queried on Spark.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)