Re: Partial object caching
No more feedback from you guys since one week. Which solution do you consider better? The chunk validity bitmap, the start/end span scheme or maybe another idea? On Wed, Sep 12, 2012 at 1:30 AM, Bogdan Graur bgr...@gmail.com wrote: Can we also consider the other way: make the request to the origin server with a larger range so that we may 'join' two disjoint parts of the data? (try to avoid having many empty chunks in between filled chunks) We can consider it but I think it won't work. It would require, among other things, cleverness in changing the outgoing request *and* tweaking the response so that the client gets the originally requested range. Unlikely to be worth it - better to do something in the nature of the other bug you mentioned where, in a plugin, completely synthetic range request are generated to fill gaps. You're right. Changing the request and the response is not the safest thing to do. The idea to generate synthetic range requests from a plugin to fill in gaps sounds very good if you say it can be done. The solution with the chunk validity bitmap has the advantage that it works fine without throwing away cached data. Does throwing away parts of a cached object while updating the cache for the same object raise any synchronization issues? No, because you don't really throw away the data, you simply change the valid span values in the fragment. Reading and writing those have to be synchronized for other reasons so it's no additional cost. Which is better may well depend on our anticipated access pattern. It's easy to think of a plausible pattern where a client rolls through an object via range requests (a very common pattern in practice) and each range request spans chunks without every covering a chunk so you don't actually store anything from the entire set of requests, even though ATS saw the entire object. The span scheme, OTOH, would end up capturing the entire object. This roll through use case might indeed be the common pattern. The chunk validity solution behavior for this case: In the worse case, as you commented, after reading the whole object we will not be able to cache anything from the object. In the average case (when a range request is bigger than the size of the chunk) the cached object will have many not filled chunks and we might have to do a lot of small range request to fill these gaps. The span scheme behavior is ideal in both the average and worst case scenario. Other use cases: - roll through a part of the object then skip a portion of a file and then roll through again (and so on): the span scheme is again the better one as this is a generalization of the common pattern. - make many disjoint range requests: the two solutions would probably be equally good with a plus on the side of the chunk validity bitmap solution as it might cache more data. Any other use cases I have missed?
Re: Partial object caching
Can we also consider the other way: make the request to the origin server with a larger range so that we may 'join' two disjoint parts of the data? (try to avoid having many empty chunks in between filled chunks) We can consider it but I think it won't work. It would require, among other things, cleverness in changing the outgoing request *and* tweaking the response so that the client gets the originally requested range. Unlikely to be worth it - better to do something in the nature of the other bug you mentioned where, in a plugin, completely synthetic range request are generated to fill gaps. You're right. Changing the request and the response is not the safest thing to do. The idea to generate synthetic range requests from a plugin to fill in gaps sounds very good if you say it can be done. The solution with the chunk validity bitmap has the advantage that it works fine without throwing away cached data. Does throwing away parts of a cached object while updating the cache for the same object raise any synchronization issues? No, because you don't really throw away the data, you simply change the valid span values in the fragment. Reading and writing those have to be synchronized for other reasons so it's no additional cost. Which is better may well depend on our anticipated access pattern. It's easy to think of a plausible pattern where a client rolls through an object via range requests (a very common pattern in practice) and each range request spans chunks without every covering a chunk so you don't actually store anything from the entire set of requests, even though ATS saw the entire object. The span scheme, OTOH, would end up capturing the entire object. This roll through use case might indeed be the common pattern. The chunk validity solution behavior for this case: In the worse case, as you commented, after reading the whole object we will not be able to cache anything from the object. In the average case (when a range request is bigger than the size of the chunk) the cached object will have many not filled chunks and we might have to do a lot of small range request to fill these gaps. The span scheme behavior is ideal in both the average and worst case scenario. Other use cases: - roll through a part of the object then skip a portion of a file and then roll through again (and so on): the span scheme is again the better one as this is a generalization of the common pattern. - make many disjoint range requests: the two solutions would probably be equally good with a plus on the side of the chunk validity bitmap solution as it might cache more data. Any other use cases I have missed?
Re: Partial object caching
Your write up seems basically accurate to me. I would note again that reading the earliest Doc of an alternate is an historical artifact of the implementation and not at all a functional requirement. The other half of the proposal is that, when servicing a range request that has been forwarded to the origin server, we cache fill any complete chunks that are returned and ignore any partial chunks. This is the point of having chunks, to make it much more likely that we can capture data from range requests. We could do that with just fragments but IMHO the granularity of that is too coarse. It might be interesting as a first step to try it to validate the partial write logic. There is also the issue of how much compile time and/or run time control should be provided over the graininess of the chunks. IMHO it would be reasonable to support compile time control over a value N such that each fragment consists of 2^N chunks. The fragment tables changes are now on trunk, committed under the TS-1339 bug ID. P.S. An alternative scheme would be to store for each fragment store a valid span (two values, start/end) and then extend that if a range request has an overlap. You could even, in the disjoint case, compute which span had more data and keep that (e.g., effectively discard cached data if you can replace it with a larger span in the fragment). Sunday, September 9, 2012, 5:37:29 PM, you wrote: Thank you very much for the clarifications! I think now I have a better view on the current cache structure and your proposal. Who should also validate these changes before going to the next step which I think would be the strategy of what and when to cache? PS: is there a branch already available with your change to move the fragment table from the first Doc to the alternate entries?
Re: Partial object caching
Your write up seems basically accurate to me. I would note again that reading the earliest Doc of an alternate is an historical artifact of the implementation and not at all a functional requirement. I see. I'm glad you cleared this out too. Initially I thought you planned to store the fragment table in this earliest Doc. I checked the trunk and saw it is stored in fact in the alternate entry itself (HTTPCacheAlt structure) The other half of the proposal is that, when servicing a range request that has been forwarded to the origin server, we cache fill any complete chunks that are returned and ignore any partial chunks. That's an elegant solution for keeping the chunk validity bitmap consistent! Can we also consider the other way: make the request to the origin server with a larger range so that we may 'join' two disjoint parts of the data? (try to avoid having many empty chunks in between filled chunks) This is the point of having chunks, to make it much more likely that we can capture data from range requests. We could do that with just fragments but IMHO the granularity of that is too coarse. It might be interesting as a first step to try it to validate the partial write logic. There is also the issue of how much compile time and/or run time control should be provided over the graininess of the chunks. IMHO it would be reasonable to support compile time control over a value N such that each fragment consists of 2^N chunks. I think this can be done. The fragment tables changes are now on trunk, committed under the TS-1339 bug ID. Thanks for pointing that out. P.S. An alternative scheme would be to store for each fragment store a valid span (two values, start/end) and then extend that if a range request has an overlap. You could even, in the disjoint case, compute which span had more data and keep that (e.g., effectively discard cached data if you can replace it with a larger span in the fragment). The solution with the chunk validity bitmap has the advantage that it works fine without throwing away cached data. Does throwing away parts of a cached object while updating the cache for the same object raise any synchronization issues? Sunday, September 9, 2012, 5:37:29 PM, you wrote: Thank you very much for the clarifications! I think now I have a better view on the current cache structure and your proposal. Who should also validate these changes before going to the next step which I think would be the strategy of what and when to cache? PS: is there a branch already available with your change to move the fragment table from the first Doc to the alternate entries?
Re: Partial object caching
Monday, September 10, 2012, 5:14:40 PM, you wrote: Can we also consider the other way: make the request to the origin server with a larger range so that we may 'join' two disjoint parts of the data? (try to avoid having many empty chunks in between filled chunks) We can consider it but I think it won't work. It would require, among other things, cleverness in changing the outgoing request *and* tweaking the response so that the client gets the originally requested range. Unlikely to be worth it - better to do something in the nature of the other bug you mentioned where, in a plugin, completely synthetic range request are generated to fill gaps. The solution with the chunk validity bitmap has the advantage that it works fine without throwing away cached data. Does throwing away parts of a cached object while updating the cache for the same object raise any synchronization issues? No, because you don't really throw away the data, you simply change the valid span values in the fragment. Reading and writing those have to be synchronized for other reasons so it's no additional cost. Which is better may well depend on our anticipated access pattern. It's easy to think of a plausible pattern where a client rolls through an object via range requests (a very common pattern in practice) and each range request spans chunks without every covering a chunk so you don't actually store anything from the entire set of requests, even though ATS saw the entire object. The span scheme, OTOH, would end up capturing the entire object.
Re: Partial object caching
Thank you very much for the clarifications! I think now I have a better view on the current cache structure and your proposal. Current flow for a range request: 1. Find the first Doc for the cached object. 2. The first Doc contains the alternate vector and the fragment table (of course the fragment table belongs to the entries in the alternate and you said you will do this change soon) 3. Find the best matching alternate entry and use that as a representation of the cached object in the next steps. 4. Read the Doc corresponding to the best matching alternate. 5. Compute the needed fragments for serving the requested range. 6. Read the needed fragments and serve the response. In the current state if an object is cached that means its whole data is available from the cache. With your proposal: 1. Find the first Doc for the cached object. 2. The first Doc contains the alternate vector. 3. Find the best matching alternate entry and use that as a representation of the cached object in the next steps. 4. Read the Doc corresponding to the best matching alternate.[with your change we will have available here the fragment table I presume] 5. Compute the needed fragments and verify that all data is available (check corresponding chunk validity info that should be located in the fragment table) 6. Read the needed fragments and serve the response if everything is available. Considering that your proposal fully supports partially cached objects without any performance hit my opinion is that it would be a good representation for our needs. The changes to the cache structure are also kept to a minimum. Who should also validate these changes before going to the next step which I think would be the strategy of what and when to cache? PS: is there a branch already available with your change to move the fragment table from the first Doc to the alternate entries? Best regards, Bogdan Graur On Thu, Sep 6, 2012 at 4:15 AM, Alan M. Carroll a...@network-geographics.com wrote: Wednesday, September 5, 2012, 4:16:42 PM, you wrote: Are all Doc structures of the same size? Mostly, but you don't need their actual size, only the size of the content per Doc instance, which is implicit in the fragment offset table. Given a range request we can walk the fragment offset table (which was read with the First/Earliest Doc) until we find the fragment which contains the starting byte of the range and being loading fragments with that one. So the fragments table entry will be augmented with a new uint64_t field which holds the chunk validity information. Right. But after computing all the necessary Doc structures for satisfying a range request they still have to be read from disk (not the content, just the first part containing the fragments table) to check for chunk validity. No. The fragment offset table will be moved in to the alternate header, which is stored in First Doc. Once that is read you have all of the fragment offset and chunk validity data in memory. The entire table, with all offsets and validity bitmaps, is a single chunk of data in the alternate header. Again, this is just a proposal to stimulate discussion, it is *not* a committed plan of action. Only when we know we have the complete range valid we may serve the range from the cache. Yes, and that can be computed from in memory data as soon as the alternate is selected.
Re: Partial object caching
1. Is there a one to many relation between a cached object and a Doc structure? Yes, that's the chained Docs in the lower right. Each Doc represents a fragment and we have discussed previously how objects are stored as an ordered set of fragments. Just for clarification: I did some tests and it seems the entries in the alternate vector stored in the first Doc are different versions of the same cached object? (I see that these entries are scanned for the best quality match and the entry with the highest score and most recent is used) So actually AltVec is a vector of linked lists, each linked list being a different version of a cached object? As the cached object is represented with a linked list of fragments does that mean that in order to serve a specific range from it we have to read all fragments (presumably from disk) for the next_key until we get to the signifficant fragment? Or maybe again I'm missing something .. 2. A Doc structure contains a table of fragments which are represented as uint64_t offsets. The fragment offsets are the location in the object of the fragment data. frag_offset[i] is the address of the first byte past the end of fragment i. To simplify, presume fragments are at most 100 bytes long and we have an object that is 390 bytes long in four fragments (0,1,2,3). Then frag_offset[1] could be 201 which would mean the next byte past the end of fragment 1 is byte 201 of 390 in the original object (or equivalently that the first byte of fragment 2 is byte 201 of 390 in the object). This data is used to accelerate range requests so that if you had a request for bytes 220-300 of the object you could skip immediately to fragment 2 without reading fragment 1. In real life there are some very large (20M+) files out there and being able to skip reading the first 10M or so from disk is an enormous performance improvement. Note the earliest doc for an alternate is always read because of the way the logic is done now, although in theory that could be skipped because you the fragment data you need is in the First Doc which doesn't contain any object data for objects with multiple alternates or that are more than 1 fragment long. From this description it seems to me that these fragments inside a fragment resemble to the chunks you describe in your original e-mail. The differences are: - currently these chunks may be of different sizes as opposed to the proposed new model of having equally sized chunks - there is currently no bitfield to say which chunks are valid Are these fragments inside a fragment currently used for anything else than range request acceleration? Best regards, Bogdan Graur
Re: Partial object caching
Wednesday, September 5, 2012, 3:54:03 AM, you wrote: Just for clarification: I did some tests and it seems the entries in the alternate vector stored in the first Doc are different versions of the same cached object? Yes, see here: http://trafficserver.apache.org/docs/trunk/sdk/http-hooks-and-transactions/http-alternate-selection.en.html As the cached object is represented with a linked list of fragments does that mean that in order to serve a specific range from it we have to read all fragments (presumably from disk) for the next_key until we get to the signifficant fragment? No. The keys are computationally linked. Given a key for a fragment, the key for the next fragment is computed, not read. Therefore you can compute the key for fragment i from the key for fragment 0 without I/O. From this description it seems to me that these fragments inside a fragment resemble to the chunks you describe in your original e-mail. No, the chunks in my original e-mail would be subranges of these fragments. Current fragments are effectively all the same size except the last fragment. Chunks are not currently used for anything as they do not exist in the current implementation. If implemented as suggested they would be more notional, effectively being stored only as the validity bitmap per fragment. Disk I/O would continue to be per fragment. Fragments exist to limit disk I/O per I/O operation. The range acceleration was bolted on by adding the fragment offset tables. Hmmm, that would be a problem - how can you verify the entire range if it spans fragments, as there may by be missing chunks in the middle? You don't want to start serving and then discover a missing chunk. Ah, that's why the validity bitmaps would be stored in the fragment table, so all of that is available in memory before reading any fragments.
Re: Partial object caching
No. The keys are computationally linked. Given a key for a fragment, the key for the next fragment is computed, not read. OK, I think I found the way the next_key is computed from a given key. Therefore you can compute the key for fragment i from the key for fragment 0 without I/O. Are all Doc structures of the same size? If they differ in size and we do not have somewhere in RAM (or in the first fragment) each of their sizes how can we get away without any I/O for the non relevant Docs when serving a range request? Hmmm, that would be a problem - how can you verify the entire range if it spans fragments, as there may by be missing chunks in the middle? You don't want to start serving and then discover a missing chunk. Ah, that's why the validity bitmaps would be stored in the fragment table, so all of that is available in memory before reading any fragments. So the fragments table entry will be augmented with a new uint64_t field which holds the chunk validity information. But after computing all the necessary Doc structures for satisfying a range request they still have to be read from disk (not the content, just the first part containing the fragments table) to check for chunk validity. Only when we know we have the complete range valid we may serve the range from the cache. Thank you very much for pointing me to the Traffic Server Programming Guide!
Re: Partial object caching
Wednesday, September 5, 2012, 4:16:42 PM, you wrote: Are all Doc structures of the same size? Mostly, but you don't need their actual size, only the size of the content per Doc instance, which is implicit in the fragment offset table. Given a range request we can walk the fragment offset table (which was read with the First/Earliest Doc) until we find the fragment which contains the starting byte of the range and being loading fragments with that one. So the fragments table entry will be augmented with a new uint64_t field which holds the chunk validity information. Right. But after computing all the necessary Doc structures for satisfying a range request they still have to be read from disk (not the content, just the first part containing the fragments table) to check for chunk validity. No. The fragment offset table will be moved in to the alternate header, which is stored in First Doc. Once that is read you have all of the fragment offset and chunk validity data in memory. The entire table, with all offsets and validity bitmaps, is a single chunk of data in the alternate header. Again, this is just a proposal to stimulate discussion, it is *not* a committed plan of action. Only when we know we have the complete range valid we may serve the range from the cache. Yes, and that can be computed from in memory data as soon as the alternate is selected.
Re: Partial object caching
Thank you for taking the time to draw that diagram! Looking at this diagram and the code some things are not clear for me: 1. Is there a one to many relation between a cached object and a Doc structure? In other words: many Doc structure hold the data for a cached object? The diagram says that only first Doc in the chain will contain the alternate information vector. This suggests there are many Doc structure associated to a cached object. 2. A Doc structure contains a table of fragments which are represented as uint64_t offsets. I understand that you plan to move this table to the alternate information. But it is not clear to me where are these offsets pointing into currently? Where is data with offset zero held? Best regards, Bogdan Graur On Wed, Aug 29, 2012 at 2:58 PM, Alan M. Carroll a...@network-geographics.com wrote: Kinda sorta of. I have made a simple diagram with some of this information: http://people.apache.org/~amc/ats-cache-layout.jpg which you might find useful. The proposed change would be quite similar to the existing cache structure but would have a couple of major differences - 1) Support having fallow fragments. The current logic requires all fragments to be present in the cache and cannot detect or handle missing fragments. 2) Subdividing each fragment in to chunks. All of that would be new. This would also require moving the fragment offset data in to the alternate information but I already have a patch for that which will be landing on trunk Real Soon Now. Tuesday, August 28, 2012, 1:14:48 PM, you wrote: This model of representing a cached object indeed allows holding only partial data for the resource. Is this the current way a cached object is currently represented in TS?
Re: Partial object caching
Thursday, August 30, 2012, 5:29:16 PM, you wrote: 1. Is there a one to many relation between a cached object and a Doc structure? Yes, that's the chained Docs in the lower right. Each Doc represents a fragment and we have discussed previously how objects are stored as an ordered set of fragments. 2. A Doc structure contains a table of fragments which are represented as uint64_t offsets. The fragment offsets are the location in the object of the fragment data. frag_offset[i] is the address of the first byte past the end of fragment i. To simplify, presume fragments are at most 100 bytes long and we have an object that is 390 bytes long in four fragments (0,1,2,3). Then frag_offset[1] could be 201 which would mean the next byte past the end of fragment 1 is byte 201 of 390 in the original object (or equivalently that the first byte of fragment 2 is byte 201 of 390 in the object). This data is used to accelerate range requests so that if you had a request for bytes 220-300 of the object you could skip immediately to fragment 2 without reading fragment 1. In real life there are some very large (20M+) files out there and being able to skip reading the first 10M or so from disk is an enormous performance improvement. Note the earliest doc for an alternate is always read because of the way the logic is done now, although in theory that could be skipped because you the fragment data you need is in the First Doc which doesn't contain any object data for objects with multiple alternates or that are more than 1 fragment long.
Re: Partial object caching
Kinda sorta of. I have made a simple diagram with some of this information: http://people.apache.org/~amc/ats-cache-layout.jpg which you might find useful. The proposed change would be quite similar to the existing cache structure but would have a couple of major differences - 1) Support having fallow fragments. The current logic requires all fragments to be present in the cache and cannot detect or handle missing fragments. 2) Subdividing each fragment in to chunks. All of that would be new. This would also require moving the fragment offset data in to the alternate information but I already have a patch for that which will be landing on trunk Real Soon Now. Tuesday, August 28, 2012, 1:14:48 PM, you wrote: This model of representing a cached object indeed allows holding only partial data for the resource. Is this the current way a cached object is currently represented in TS?
Re: Partial object caching
This model of representing a cached object indeed allows holding only partial data for the resource. Is this the current way a cached object is currently represented in TS? On Fri, Aug 24, 2012 at 2:19 PM, Alan M. Carroll a...@network-geographics.com wrote: Thursday, August 23, 2012, 9:00:29 PM, you wrote: If 1 of N fragments is cached, can we currently serve ranges out of that? No.
Re: Partial object caching
On 23/08/2012, at 5:23 PM, Alan M. Carroll a...@network-geographics.com wrote: Since this has been a topic for a while, I will just throw out an idea to see how fast you guys can shoot it down. A cache object is stored as a series of fragments. If we subdivided each fragment in to chunks, we could have 64 chunks / fragment and represent them with a bitmap in a single uint64_t. A set bit would indicate valid data in that chunk. Partial content would be written only in chunk units and only for chunks that are complete in the data. For the default size of 1M fragments, each chunk would be 16K which seems a reasonable value. The bitmaps would be stored along with the fragment offset table in the alternate info header. This would keep it out of the directory while making it available when serving because the alternate data is loaded before that point. Range validity checks could also be done without additional disk I/O because you can't detect if a range is valid for an object before the alternate is determined. We could only serve if the request range was completely covered, or generate a synthetic range request to cover parts that were not already in the cache. This would mean that files less than one fragment would not have partial content cached but I think that's acceptable as the advantages of partial caching are only for larger objects. If 1 of N fragments is cached, can we currently serve ranges out of that? J