[
https://issues.apache.org/jira/browse/OAK-6922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tomek Rękawek updated OAK-6922:
-------------------------------
Description:
A HDFS implementation of the segment storage, based on the OAK-6921 work.
h3. HDFS
HDFS is a distributed, network file system. The most popular implementation is
Apache Hadoop, but the HDFS is also available in the Amazon AWS and Microsoft
Azure clouds. Despite being a file system, it requires a custom API to be used
- it's not similar to NFS or CIFS which can be simply mounted locally.
h3. Segment files layout
Thew new implementation doesn't use tar files. They are replaced with
directories, storing segments, named after their UUIDs. This approach has
following advantages:
* no need to call seek(), which may be expensive on a remote file system.
Rather than that we can read the whole file (=segment) at once.
* it's possible to send multiple segments at once, asynchronously, which
reduces the performance overhead (see below).
The file structure is as follows:
{noformat}
$ hdfs dfs -ls /oak/data00000a.tar
Found 517 items
-rw-r--r-- 1 rekawek supergroup 192 2017-11-08 13:24
/oak/data00000a.tar/0000.b1d032cd-266d-4fd6-acf4-9828f54e2b40
-rw-r--r-- 1 rekawek supergroup 262112 2017-11-08 13:24
/oak/data00000a.tar/0001.445ca696-d5d1-4843-a04b-044f84d93663
-rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24
/oak/data00000a.tar/0002.91ce6f93-d7ed-4d34-a383-3c3d2eea2acb
-rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24
/oak/data00000a.tar/0003.43c09e6f-3d62-4747-ac75-de41b850862a
(...)
-rw-r--r-- 1 rekawek supergroup 191888 2017-11-08 13:32
/oak/data00000a.tar/data00000a.tar.brf
-rw-r--r-- 1 rekawek supergroup 823436 2017-11-08 13:32
/oak/data00000a.tar/data00000a.tar.gph
-rw-r--r-- 1 rekawek supergroup 17408 2017-11-08 13:32
/oak/data00000a.tar/data00000a.tar.idx
{noformat}
For the segment files, each name is prefixed with the index number. This allows
to maintain an order, as in the tar archive. This order is normally stored in
the index files as well, but if it's missing, the recovery process needs it.
Each file contains the raw segment data, with no padding/headers. Apart from
the segment files, there are 3 special files: binary references (.brf), segment
graph (.gph) and segment index (.idx).
h3. Asynchronous writes
Normally, all the TarWriter writes are synchronous, appending the segments to
the tar file. In case of HDFS each write involves a network latency. That's why
the SegmentWriteQueue was introduced. The segments are added to the blocking
deque, which is served by a number of the consumer threads, writing the
segments to HDFS. There's also a map UUID->Segment, which allows to return the
segments in case they are requested by the readSegment() method before they are
actually persisted. Segments are removed from the map only after a successful
write operation.
The flush() method blocks accepting the new segments and returns after all
waiting segments are written. The close() method waits until the current
operations are finished and stops all threads.
The asynchronous mode can be disabled by setting the number of threads to 0.
h5. TODO: queue recovery mode
We need to handle the write failures in a better way: if the HDFS write()
operation fails, we should re-add the segment to the queue and switch to an
"recovery mode". In this mode, all the threads are suspended and new segments
are not accepted (active waiting). There's a single thread which retries adding
the segment with some delay. If the segment is successfully written, the queue
will back to the normal operation.
This way the unavailable HDFS service is not flooded by the requests and we're
not accepting the segments when we can't persist them.
The close() method finishes the recovery mode - in this case, some of the
awaiting segments won't be persisted.
h5. Consistency
The asynchronous mode isn't as reliable as the standard, synchronous case.
Following cases are possible:
* TarWriter#writeEntry() returns successfully, but the segments are not
persisted.
* TarWriter#writeEntry() accepts a number of segments: S1, S2, S3. The S2 and
S3 are persisted, but the S1 is not.
On the other hand:
* If the TarWriter#flush() returns successfully, it means that all the accepted
segments has been persisted.
This may lead to data inconsistency - especially the second case, where we lose
a middle segment. The impact is till to be discussed.
h3. TODO
* implement the queue recovery mode, as above,
* test the queue,
* move the implementation to its own bundle (requires OSGi support for
SegmentArchvieManager).
was:
A HDFS implementation of the segment storage, based on the OAK-6921 work.
h3. HDFS
HDFS is a distributed, network file system. The most popular implementation is
Apache Hadoop, but the HDFS is also available in the Amazon AWS and Microsoft
Azure clouds. Despite being a file system, it requires a custom API to be used
- it's not similar to NFS or CIFS which can be simply mounted locally.
h3. Segment files layout
Thew new implementation doesn't use tar files. They are replaced with
directories, storing segments, named after their UUIDs. This approach has
following advantages:
* no need to call seek(), which may be expensive on a remote file system.
Rather than that we can read the whole file (=segment) at once.
* it's possible to send multiple segments at once, asynchronously, which
reduces the performance overhead (see below).
The file structure is as follows:
{noformat}
$ hdfs dfs -ls /oak/data00000a.tar
Found 517 items
-rw-r--r-- 1 rekawek supergroup 192 2017-11-08 13:24
/oak/data00000a.tar/0000.b1d032cd-266d-4fd6-acf4-9828f54e2b40
-rw-r--r-- 1 rekawek supergroup 262112 2017-11-08 13:24
/oak/data00000a.tar/0001.445ca696-d5d1-4843-a04b-044f84d93663
-rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24
/oak/data00000a.tar/0002.91ce6f93-d7ed-4d34-a383-3c3d2eea2acb
-rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24
/oak/data00000a.tar/0003.43c09e6f-3d62-4747-ac75-de41b850862a
(...)
-rw-r--r-- 1 rekawek supergroup 191888 2017-11-08 13:32
/oak/data00000a.tar/data00000a.tar.brf
-rw-r--r-- 1 rekawek supergroup 823436 2017-11-08 13:32
/oak/data00000a.tar/data00000a.tar.gph
-rw-r--r-- 1 rekawek supergroup 17408 2017-11-08 13:32
/oak/data00000a.tar/data00000a.tar.idx
{noformat}
For the segment files, each name is prefixed with the index number. This allows
to maintain an order, as in the tar archive. This order is normally stored in
the index files as well, but if it's missing, the recovery process needs it.
Each file contains the raw segment data, with no padding/headers. Apart from
the segment files, there are 3 special files: binary references (.brf), segment
graph (.gph) and segment index (.idx).
h3. Asynchronous writes
Normally, all the TarWriter writes are synchronous, appending the segments to
the tar file. In case of HDFS each write involves a network latency. That's why
the SegmentWriteQueue was introduced. The segments are added to the blocking
deque, which is served by a number of the consumer threads, writing the
segments to HDFS. There's also a map UUID->Segment, which allows to return the
segments in case they are requested by the readSegment() method before they are
actually persisted. Segments are removed from the map only after a successful
write operation.
The flush() method blocks accepting the new segments and returns after all
waiting segments are written. The close() method waits until the current
operations are finished and stops all threads.
The asynchronous mode can be disabled by setting the number of threads to 0.
h5. TODO: queue recovery mode
We need to handle the write failures in a better way: if the HDFS write()
operation fails, we should re-add the segment to the queue and switch to an
"recovery mode". In this mode, all the threads are suspended and new segments
are not accepted (active waiting). There's a single thread which retries adding
the segment with some delay. If the segment is successfully written, the queue
will back to the normal operation.
This way the unavailable HDFS service is not flooded by the requests and we're
not accepting the segments when we can't persist them.
The close() method finishes the recovery mode - in this case, some of the
awaiting segments won't be persisted.
h5. Consistency
The asynchronous mode isn't as reliable as the standard, synchronous case.
Following cases are possible:
* TarWriter#writeEntry() returns successfully, but the segments are not
persisted.
* TarWriter#writeEntry() accepts a number of segments: S1, S2, S3. The S2 and
S3 are persisted, but the S1 is not.
On the other hand:
* If the TarWriter#flush() returns successfully, it means that all the accepted
segments has been persisted.
This may lead to data inconsistency - especially the second case, where we lose
a middle segment. The impact is till to be discussed.
h3. TODO
* implement the queue recovery mode, as above,
* move the implementation to its own bundle (requires OSGi support for
SegmentArchvieManager).
> HDFS support for the segment-tar
> --------------------------------
>
> Key: OAK-6922
> URL: https://issues.apache.org/jira/browse/OAK-6922
> Project: Jackrabbit Oak
> Issue Type: New Feature
> Components: segment-tar
> Reporter: Tomek Rękawek
> Fix For: 1.9.0
>
> Attachments: OAK-6922.patch
>
>
> A HDFS implementation of the segment storage, based on the OAK-6921 work.
> h3. HDFS
> HDFS is a distributed, network file system. The most popular implementation
> is Apache Hadoop, but the HDFS is also available in the Amazon AWS and
> Microsoft Azure clouds. Despite being a file system, it requires a custom API
> to be used - it's not similar to NFS or CIFS which can be simply mounted
> locally.
> h3. Segment files layout
> Thew new implementation doesn't use tar files. They are replaced with
> directories, storing segments, named after their UUIDs. This approach has
> following advantages:
> * no need to call seek(), which may be expensive on a remote file system.
> Rather than that we can read the whole file (=segment) at once.
> * it's possible to send multiple segments at once, asynchronously, which
> reduces the performance overhead (see below).
> The file structure is as follows:
> {noformat}
> $ hdfs dfs -ls /oak/data00000a.tar
> Found 517 items
> -rw-r--r-- 1 rekawek supergroup 192 2017-11-08 13:24
> /oak/data00000a.tar/0000.b1d032cd-266d-4fd6-acf4-9828f54e2b40
> -rw-r--r-- 1 rekawek supergroup 262112 2017-11-08 13:24
> /oak/data00000a.tar/0001.445ca696-d5d1-4843-a04b-044f84d93663
> -rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24
> /oak/data00000a.tar/0002.91ce6f93-d7ed-4d34-a383-3c3d2eea2acb
> -rw-r--r-- 1 rekawek supergroup 262144 2017-11-08 13:24
> /oak/data00000a.tar/0003.43c09e6f-3d62-4747-ac75-de41b850862a
> (...)
> -rw-r--r-- 1 rekawek supergroup 191888 2017-11-08 13:32
> /oak/data00000a.tar/data00000a.tar.brf
> -rw-r--r-- 1 rekawek supergroup 823436 2017-11-08 13:32
> /oak/data00000a.tar/data00000a.tar.gph
> -rw-r--r-- 1 rekawek supergroup 17408 2017-11-08 13:32
> /oak/data00000a.tar/data00000a.tar.idx
> {noformat}
> For the segment files, each name is prefixed with the index number. This
> allows to maintain an order, as in the tar archive. This order is normally
> stored in the index files as well, but if it's missing, the recovery process
> needs it.
> Each file contains the raw segment data, with no padding/headers. Apart from
> the segment files, there are 3 special files: binary references (.brf),
> segment graph (.gph) and segment index (.idx).
> h3. Asynchronous writes
> Normally, all the TarWriter writes are synchronous, appending the segments to
> the tar file. In case of HDFS each write involves a network latency. That's
> why the SegmentWriteQueue was introduced. The segments are added to the
> blocking deque, which is served by a number of the consumer threads, writing
> the segments to HDFS. There's also a map UUID->Segment, which allows to
> return the segments in case they are requested by the readSegment() method
> before they are actually persisted. Segments are removed from the map only
> after a successful write operation.
> The flush() method blocks accepting the new segments and returns after all
> waiting segments are written. The close() method waits until the current
> operations are finished and stops all threads.
> The asynchronous mode can be disabled by setting the number of threads to 0.
> h5. TODO: queue recovery mode
> We need to handle the write failures in a better way: if the HDFS write()
> operation fails, we should re-add the segment to the queue and switch to an
> "recovery mode". In this mode, all the threads are suspended and new segments
> are not accepted (active waiting). There's a single thread which retries
> adding the segment with some delay. If the segment is successfully written,
> the queue will back to the normal operation.
> This way the unavailable HDFS service is not flooded by the requests and
> we're not accepting the segments when we can't persist them.
> The close() method finishes the recovery mode - in this case, some of the
> awaiting segments won't be persisted.
> h5. Consistency
> The asynchronous mode isn't as reliable as the standard, synchronous case.
> Following cases are possible:
> * TarWriter#writeEntry() returns successfully, but the segments are not
> persisted.
> * TarWriter#writeEntry() accepts a number of segments: S1, S2, S3. The S2 and
> S3 are persisted, but the S1 is not.
> On the other hand:
> * If the TarWriter#flush() returns successfully, it means that all the
> accepted segments has been persisted.
> This may lead to data inconsistency - especially the second case, where we
> lose a middle segment. The impact is till to be discussed.
> h3. TODO
> * implement the queue recovery mode, as above,
> * test the queue,
> * move the implementation to its own bundle (requires OSGi support for
> SegmentArchvieManager).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)