[ https://issues.apache.org/jira/browse/OAK-6922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Davide Giannella updated OAK-6922: ---------------------------------- Fix Version/s: 1.10 > HDFS support for the segment-tar > -------------------------------- > > Key: OAK-6922 > URL: https://issues.apache.org/jira/browse/OAK-6922 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: segment-tar > Reporter: Tomek Rękawek > Fix For: 1.9.0, 1.10 > > Attachments: OAK-6922.patch > > > A HDFS implementation of the segment storage, based on the OAK-6921 work. > h3. HDFS > HDFS is a distributed, network file system. The most popular implementation > is Apache Hadoop, but the HDFS is also available in the Amazon AWS and > Microsoft Azure clouds. Despite being a file system, it requires a custom API > to be used - it's not similar to NFS or CIFS which can be simply mounted > locally. > h3. Segment files layout > Thew new implementation doesn't use tar files. They are replaced with > directories, storing segments, named after their UUIDs. This approach has > following advantages: > * no need to call seek(), which may be expensive on a remote file system. > Rather than that we can read the whole file (=segment) at once. > * it's possible to send multiple segments at once, asynchronously, which > reduces the performance overhead (see below). > The file structure is as follows: > {noformat} > $ hdfs dfs -ls /oak/data00000a.tar > Found 517 items > -rw-r--r-- 1 rekawek supergroup 192 2017-11-08 13:24 > /oak/data00000a.tar/0000.b1d032cd-266d-4fd6-acf4-9828f54e2b40 > -rw-r--r-- 1 rekawek supergroup 262112 2017-11-08 13:24 > /oak/data00000a.tar/0001.445ca696-d5d1-4843-a04b-044f84d93663 > -rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24 > /oak/data00000a.tar/0002.91ce6f93-d7ed-4d34-a383-3c3d2eea2acb > -rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24 > /oak/data00000a.tar/0003.43c09e6f-3d62-4747-ac75-de41b850862a > (...) > -rw-r--r-- 1 rekawek supergroup 191888 2017-11-08 13:32 > /oak/data00000a.tar/data00000a.tar.brf > -rw-r--r-- 1 rekawek supergroup 823436 2017-11-08 13:32 > /oak/data00000a.tar/data00000a.tar.gph > -rw-r--r-- 1 rekawek supergroup 17408 2017-11-08 13:32 > /oak/data00000a.tar/data00000a.tar.idx > {noformat} > For the segment files, each name is prefixed with the index number. This > allows to maintain an order, as in the tar archive. This order is normally > stored in the index files as well, but if it's missing, the recovery process > needs it. > Each file contains the raw segment data, with no padding/headers. Apart from > the segment files, there are 3 special files: binary references (.brf), > segment graph (.gph) and segment index (.idx). > h3. Asynchronous writes > Normally, all the TarWriter writes are synchronous, appending the segments to > the tar file. In case of HDFS each write involves a network latency. That's > why the SegmentWriteQueue was introduced. The segments are added to the > blocking deque, which is served by a number of the consumer threads, writing > the segments to HDFS. There's also a map UUID->Segment, which allows to > return the segments in case they are requested by the readSegment() method > before they are actually persisted. Segments are removed from the map only > after a successful write operation. > The flush() method blocks accepting the new segments and returns after all > waiting segments are written. The close() method waits until the current > operations are finished and stops all threads. > The asynchronous mode can be disabled by setting the number of threads to 0. > h5. Queue recovery mode > If the HDFS write() operation fails, the segment will be re-added and the > queue is switched to an "recovery mode". In this mode, all the threads are > suspended and new segments are not accepted (active waiting). There's a > single thread which retries adding the segment with some delay. If the > segment is successfully written, the queue will back to the normal operation. > This way the unavailable HDFS service is not flooded by the requests and > we're not accepting the segments when we can't persist them. > The close() method finishes the recovery mode - in this case, some of the > awaiting segments won't be persisted. > h5. Consistency > The asynchronous mode isn't as reliable as the standard, synchronous case. > Following cases are possible: > * TarWriter#writeEntry() returns successfully, but the segments are not > persisted. > * TarWriter#writeEntry() accepts a number of segments: S1, S2, S3. The S2 and > S3 are persisted, but the S1 is not. > On the other hand: > * If the TarWriter#flush() returns successfully, it means that all the > accepted segments has been persisted. > h5. Recovery > During the segment recovery (eg. if the index file is missing), the HDFS > implementation checks if there's no missing segment in the middle. If so, > only the consecutive segments are recovered. For instance, if we have S1, S2, > S3, S5, S6, S7, then the recovery process will return only the first three. > h3. TODO > * move the implementation to its own bundle (requires OSGi support for > SegmentArchvieManager). -- This message was sent by Atlassian JIRA (v6.4.14#64029)