[
https://issues.apache.org/jira/browse/OAK-6922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tomek Rękawek updated OAK-6922:
-------------------------------
Attachment: OAK-6922.patch
> HDFS support for the segment-tar
> --------------------------------
>
> Key: OAK-6922
> URL: https://issues.apache.org/jira/browse/OAK-6922
> Project: Jackrabbit Oak
> Issue Type: New Feature
> Components: segment-tar
> Reporter: Tomek Rękawek
> Fix For: 1.9.0
>
> Attachments: OAK-6922.patch
>
>
> A HDFS implementation of the segment storage, based on the OAK-6921 work.
> h3. HDFS
> HDFS is a distributed, network file system. The most popular implementation
> is Apache Hadoop, but the HDFS is also available in the Amazon AWS and
> Microsoft Azure clouds. Despite being a file system, it requires a custom API
> to be used - it's not similar to NFS or CIFS which can be simply mounted
> locally.
> h3. Segment files layout
> Thew new implementation doesn't use tar files. They are replaced with
> directories, storing segments, named after their UUIDs. This approach has
> following advantages:
> * no need to call seek(), which may be expensive on a remote file system.
> Rather than that we can read the whole file (=segment) at once.
> * it's possible to send multiple segments at once, asynchronously, which
> reduces the performance overhead (see below).
> The file structure is as follows:
> {noformat}
> $ hdfs dfs -ls /oak/data00000a.tar
> Found 517 items
> -rw-r--r-- 1 rekawek supergroup 192 2017-11-08 13:24
> /oak/data00000a.tar/0000.b1d032cd-266d-4fd6-acf4-9828f54e2b40
> -rw-r--r-- 1 rekawek supergroup 262112 2017-11-08 13:24
> /oak/data00000a.tar/0001.445ca696-d5d1-4843-a04b-044f84d93663
> -rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24
> /oak/data00000a.tar/0002.91ce6f93-d7ed-4d34-a383-3c3d2eea2acb
> -rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24
> /oak/data00000a.tar/0003.43c09e6f-3d62-4747-ac75-de41b850862a
> (...)
> -rw-r--r-- 1 rekawek supergroup 191888 2017-11-08 13:32
> /oak/data00000a.tar/data00000a.tar.brf
> -rw-r--r-- 1 rekawek supergroup 823436 2017-11-08 13:32
> /oak/data00000a.tar/data00000a.tar.gph
> -rw-r--r-- 1 rekawek supergroup 17408 2017-11-08 13:32
> /oak/data00000a.tar/data00000a.tar.idx
> {noformat}
> For the segment files, each name is prefixed with the index number. This
> allows to maintain an order, as in the tar archive. This order is normally
> stored in the index files as well, but if it's missing, the recovery process
> needs it.
> Each file contains the raw segment data, with no padding/headers. Apart from
> the segment files, there are 3 special files: binary references (.brf),
> segment graph (.gph) and segment index (.idx).
> h3. Asynchronous writes
> Normally, all the TarWriter writes are synchronous, appending the segments to
> the tar file. In case of HDFS each write involves a network latency. That's
> why the SegmentWriteQueue was introduced. The segments are added to the
> blocking deque, which is served by a number of the consumer threads, writing
> the segments to HDFS. There's also a map UUID->Segment, which allows to
> return the segments in case they are requested by the readSegment() method
> before they are actually persisted. Segments are removed from the map only
> after a successful write operation.
> The flush() method blocks accepting the new segments and returns after all
> waiting segments are written. The close() method waits until the current
> operations are finished and stops all threads.
> The asynchronous mode can be disabled by setting the number of threads to 0.
> h5. TODO: queue recovery mode
> We need to handle the write failures in a better way: if the HDFS write()
> operation fails, we should re-add the segment to the queue and switch to an
> "recovery mode". In this mode, all the threads are suspended and new segments
> are not accepted (active waiting). There's a single thread which retries
> adding the segment with some delay. If the segment is successfully written,
> the queue will back to the normal operation.
> This way the unavailable HDFS service is not flooded by the requests and
> we're not accepting the segments when we can't persist them.
> The close() method finishes the recovery mode - in this case, some of the
> awaiting segments won't be persisted.
> h5. Consistency
> The asynchronous mode isn't as reliable as the standard, synchronous case.
> Following cases are possible:
> * TarWriter#writeEntry() returns successfully, but the segments are not
> persisted.
> * TarWriter#writeEntry() accepts a number of segments: S1, S2, S3. The S2 and
> S3 are persisted, but the S1 is not.
> On the other hand:
> * If the TarWriter#flush() returns successfully, it means that all the
> accepted segments has been persisted.
> This may lead to data inconsistency - especially the second case, where we
> lose a middle segment. The impact is still to be discussed.
> h3. TODO
> * implement the queue recovery mode, as above,
> * test the queue,
> * move the implementation to its own bundle (requires OSGi support for
> SegmentArchvieManager).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)