Re: Partial object caching

2012-09-19 Thread Bogdan Graur
No more feedback from you guys since one week.

Which solution do you consider better? The chunk validity bitmap, the
start/end span scheme or maybe another idea?


On Wed, Sep 12, 2012 at 1:30 AM, Bogdan Graur bgr...@gmail.com wrote:


  Can we also consider the other way: make the request to the origin server
  with a larger range so that we may 'join' two disjoint parts of the
 data?
  (try to avoid having many empty chunks in between filled chunks)

 We can consider it but I think it won't work. It would require, among
 other things, cleverness in changing the outgoing request *and* tweaking
 the response so that the client gets the originally requested range.
 Unlikely to be worth it - better to do something in the nature of the other
 bug you mentioned where, in a plugin, completely synthetic range request
 are generated to fill gaps.

 You're right. Changing the request and the response is not the safest
 thing to do.
 The idea to generate synthetic range requests from a plugin to fill in
 gaps sounds very good if you say it can be done.


  The solution with the chunk validity bitmap has the advantage that it
 works
  fine without throwing away cached data.
  Does throwing away parts of a cached object while updating the cache for
  the same object raise any synchronization issues?

 No, because you don't really throw away the data, you simply change the
 valid span values in the fragment. Reading and writing those have to be
 synchronized for other reasons so it's no additional cost.

 Which is better may well depend on our anticipated access pattern. It's
 easy to think of a plausible pattern where a client rolls through an
 object via range requests (a very common pattern in practice) and each
 range request spans chunks without every covering a chunk so you don't
 actually store anything from the entire set of requests, even though ATS
 saw the entire object. The span scheme, OTOH, would end up capturing the
 entire object.


 This roll through use case might indeed be the common pattern.

 The chunk validity solution behavior for this case:
 In the worse case, as you commented, after reading the whole object we
 will not be able to cache anything from the object.
 In the average case (when a range request is bigger than the size of the
 chunk) the cached object will have many not filled chunks and we might have
 to do a lot of small range request to fill these gaps.

 The span scheme behavior is ideal in both the average and worst case
 scenario.

 Other use cases:
 - roll through a part of the object then skip a portion of a file and
 then roll through again (and so on): the span scheme is again the better
 one as this is a generalization of the common pattern.
 - make many disjoint range requests: the two solutions would probably be
 equally good with a plus on the side of the chunk validity bitmap solution
 as it might cache more data.

 Any other use cases I have missed?







Re: Partial object caching

2012-09-11 Thread Bogdan Graur
  Can we also consider the other way: make the request to the origin server
  with a larger range so that we may 'join' two disjoint parts of the data?
  (try to avoid having many empty chunks in between filled chunks)

 We can consider it but I think it won't work. It would require, among
 other things, cleverness in changing the outgoing request *and* tweaking
 the response so that the client gets the originally requested range.
 Unlikely to be worth it - better to do something in the nature of the other
 bug you mentioned where, in a plugin, completely synthetic range request
 are generated to fill gaps.

You're right. Changing the request and the response is not the safest thing
to do.
The idea to generate synthetic range requests from a plugin to fill in gaps
sounds very good if you say it can be done.


  The solution with the chunk validity bitmap has the advantage that it
 works
  fine without throwing away cached data.
  Does throwing away parts of a cached object while updating the cache for
  the same object raise any synchronization issues?

 No, because you don't really throw away the data, you simply change the
 valid span values in the fragment. Reading and writing those have to be
 synchronized for other reasons so it's no additional cost.

 Which is better may well depend on our anticipated access pattern. It's
 easy to think of a plausible pattern where a client rolls through an
 object via range requests (a very common pattern in practice) and each
 range request spans chunks without every covering a chunk so you don't
 actually store anything from the entire set of requests, even though ATS
 saw the entire object. The span scheme, OTOH, would end up capturing the
 entire object.


This roll through use case might indeed be the common pattern.

The chunk validity solution behavior for this case:
In the worse case, as you commented, after reading the whole object we will
not be able to cache anything from the object.
In the average case (when a range request is bigger than the size of the
chunk) the cached object will have many not filled chunks and we might have
to do a lot of small range request to fill these gaps.

The span scheme behavior is ideal in both the average and worst case
scenario.

Other use cases:
- roll through a part of the object then skip a portion of a file and
then roll through again (and so on): the span scheme is again the better
one as this is a generalization of the common pattern.
- make many disjoint range requests: the two solutions would probably be
equally good with a plus on the side of the chunk validity bitmap solution
as it might cache more data.

Any other use cases I have missed?


Re: Partial object caching

2012-09-10 Thread Alan M. Carroll
Your write up seems basically accurate to me. I would note again that reading 
the earliest Doc of an alternate is an historical artifact of the 
implementation and not at all a functional requirement.

The other half of the proposal is that, when servicing a range request that has 
been forwarded to the origin server, we cache fill any complete chunks that are 
returned and ignore any partial chunks. This is the point of having chunks, to 
make it much more likely that we can capture data from range requests. We could 
do that with just fragments but IMHO the granularity of that is too coarse. It 
might be interesting as a first step to try it to validate the partial write 
logic.

There is also the issue of how much compile time and/or run time control should 
be provided over the graininess of the chunks. IMHO it would be reasonable to 
support compile time control over a value N such that each fragment consists of 
2^N chunks.

The fragment tables changes are now on trunk, committed under the TS-1339 bug 
ID.

P.S. An alternative scheme would be to store for each fragment store a valid 
span (two values, start/end) and then extend that if a range request has an 
overlap. You could even, in the disjoint case, compute which span had more data 
and keep that (e.g., effectively discard cached data if you can replace it with 
a larger span in the fragment).

Sunday, September 9, 2012, 5:37:29 PM, you wrote:

 Thank you very much for the clarifications!

 I think now I have a better view on the current cache structure and your
 proposal.

 Who should also validate these changes before going to the next step which
 I think would be the strategy of what and when to cache?

 PS: is there a branch already available with your change to move the
 fragment table from the first Doc to the alternate entries?



Re: Partial object caching

2012-09-10 Thread Bogdan Graur
 Your write up seems basically accurate to me. I would note again that
 reading the earliest Doc of an alternate is an historical artifact of the
 implementation and not at all a functional requirement.


I see. I'm glad you cleared this out too. Initially I thought you planned
to store the fragment table in this earliest Doc. I checked the trunk and
saw it is stored in fact in the alternate entry itself (HTTPCacheAlt
structure)

The other half of the proposal is that, when servicing a range request that
 has been forwarded to the origin server, we cache fill any complete chunks
 that are returned and ignore any partial chunks.

That's an elegant solution for keeping the chunk validity bitmap consistent!
Can we also consider the other way: make the request to the origin server
with a larger range so that we may 'join' two disjoint parts of the data?
(try to avoid having many empty chunks in between filled chunks)


 This is the point of having chunks, to make it much more likely that we
 can capture data from range requests. We could do that with just fragments
 but IMHO the granularity of that is too coarse. It might be interesting as
 a first step to try it to validate the partial write logic.

 There is also the issue of how much compile time and/or run time control
 should be provided over the graininess of the chunks. IMHO it would be
 reasonable to support compile time control over a value N such that each
 fragment consists of 2^N chunks.

I think this can be done.


 The fragment tables changes are now on trunk, committed under the TS-1339
 bug ID.

Thanks for pointing that out.



 P.S. An alternative scheme would be to store for each fragment store a
 valid span (two values, start/end) and then extend that if a range request
 has an overlap. You could even, in the disjoint case, compute which span
 had more data and keep that (e.g., effectively discard cached data if you
 can replace it with a larger span in the fragment).


The solution with the chunk validity bitmap has the advantage that it works
fine without throwing away cached data.
Does throwing away parts of a cached object while updating the cache for
the same object raise any synchronization issues?





 Sunday, September 9, 2012, 5:37:29 PM, you wrote:

  Thank you very much for the clarifications!

  I think now I have a better view on the current cache structure and your
  proposal.

  Who should also validate these changes before going to the next step
 which
  I think would be the strategy of what and when to cache?

  PS: is there a branch already available with your change to move the
  fragment table from the first Doc to the alternate entries?




Re: Partial object caching

2012-09-10 Thread Alan M. Carroll
Monday, September 10, 2012, 5:14:40 PM, you wrote:

 Can we also consider the other way: make the request to the origin server
 with a larger range so that we may 'join' two disjoint parts of the data?
 (try to avoid having many empty chunks in between filled chunks)

We can consider it but I think it won't work. It would require, among other 
things, cleverness in changing the outgoing request *and* tweaking the response 
so that the client gets the originally requested range. Unlikely to be worth it 
- better to do something in the nature of the other bug you mentioned where, in 
a plugin, completely synthetic range request are generated to fill gaps.


 The solution with the chunk validity bitmap has the advantage that it works
 fine without throwing away cached data.
 Does throwing away parts of a cached object while updating the cache for
 the same object raise any synchronization issues?

No, because you don't really throw away the data, you simply change the valid 
span values in the fragment. Reading and writing those have to be synchronized 
for other reasons so it's no additional cost.

Which is better may well depend on our anticipated access pattern. It's easy to 
think of a plausible pattern where a client rolls through an object via range 
requests (a very common pattern in practice) and each range request spans 
chunks without every covering a chunk so you don't actually store anything from 
the entire set of requests, even though ATS saw the entire object. The span 
scheme, OTOH, would end up capturing the entire object.



Re: Partial object caching

2012-09-09 Thread Bogdan Graur
Thank you very much for the clarifications!

I think now I have a better view on the current cache structure and your
proposal.

Current flow for a range request:
1. Find the first Doc for the cached object.
2. The first Doc contains the alternate vector and the fragment table (of
course the fragment table belongs to the entries in the alternate and you
said you will do this change soon)
3. Find the best matching alternate entry and use that as a representation
of the cached object in the next steps.
4. Read the Doc corresponding to the best matching alternate.
5. Compute the needed fragments for serving the requested range.
6. Read the needed fragments and serve the response.

In the current state if an object is cached that means its whole data is
available from the cache.

With your proposal:
1. Find the first Doc for the cached object.
2. The first Doc contains the alternate vector.
3. Find the best matching alternate entry and use that as a representation
of the cached object in the next steps.
4. Read the Doc corresponding to the best matching alternate.[with your
change we will have available here the fragment table I presume]
5. Compute the needed fragments and verify that all data is available
(check corresponding chunk validity info that should be located in the
fragment table)
6. Read the needed fragments and serve the response if everything is
available.

Considering that your proposal fully supports partially cached objects
without any performance hit my opinion is that it would be a good
representation for our needs.
The changes to the cache structure are also kept to a minimum.

Who should also validate these changes before going to the next step which
I think would be the strategy of what and when to cache?

PS: is there a branch already available with your change to move the
fragment table from the first Doc to the alternate entries?

Best regards,
Bogdan Graur


On Thu, Sep 6, 2012 at 4:15 AM, Alan M. Carroll a...@network-geographics.com
 wrote:

 Wednesday, September 5, 2012, 4:16:42 PM, you wrote:

  Are all Doc structures of the same size?

 Mostly, but you don't need their actual size, only the size of the content
 per Doc instance, which is implicit in the fragment offset table. Given a
 range request we can walk the fragment offset table (which was read with
 the First/Earliest Doc) until we find the fragment which contains the
 starting byte of the range and being loading fragments with that one.

  So the fragments table entry will be augmented with a new uint64_t field
  which holds the chunk validity information.

 Right.

  But after computing all the necessary Doc structures for satisfying a
  range request they still have to be read from disk (not the content, just
  the first part containing the fragments table) to check for chunk
 validity.

 No. The fragment offset table will be moved in to the alternate header,
 which is stored in First Doc. Once that is read you have all of the
 fragment offset and chunk validity data in memory. The entire table, with
 all offsets and validity bitmaps, is a single chunk of data in the
 alternate header.

 Again, this is just a proposal to stimulate discussion, it is *not* a
 committed plan of action.

  Only when we know we have the complete range valid we may serve the range
  from the cache.

 Yes, and that can be computed from in memory data as soon as the alternate
 is selected.




Re: Partial object caching

2012-09-05 Thread Bogdan Graur

  1. Is there a one to many relation between a cached object and a Doc
  structure?

 Yes, that's the chained Docs in the lower right. Each Doc represents a
 fragment and we have discussed previously how objects are stored as an
 ordered set of fragments.


Just for clarification: I did some tests and it seems the entries in the
alternate vector stored in the first Doc are different versions of the
same cached object? (I see that these entries are scanned for the best
quality match and the entry with the highest score and most recent is used)
So actually AltVec is a vector of linked lists, each linked list being a
different version of a cached object?

As the cached object is represented with a linked list of fragments does
that mean that in order to serve a specific range from it we have to read
all fragments (presumably from disk) for the next_key until we get to the
signifficant fragment? Or maybe again I'm missing something ..


  2. A Doc structure contains a table of fragments which are represented as
  uint64_t offsets.

 The fragment offsets are the location in the object of the fragment data.
 frag_offset[i] is the address of the first byte past the end of fragment i.
 To simplify, presume fragments are at most 100 bytes long and we have an
 object that is 390 bytes long in four fragments (0,1,2,3). Then
 frag_offset[1] could be 201 which would mean the next byte past the end of
 fragment 1 is byte 201 of 390 in the original object (or equivalently that
 the first byte of fragment 2 is byte 201 of 390 in the object). This data
 is used to accelerate range requests so that if you had a request for bytes
 220-300 of the object you could skip immediately to fragment 2 without
 reading fragment 1. In real life there are some very large (20M+) files out
 there and being able to skip reading the first 10M or so from disk is an
 enormous performance improvement. Note the earliest doc for an alternate is
 always read because of the way the logic is done now, although in theory
 that could be skipped because you the fragment data you need is in the
 First Doc which doesn't contain any object data for objects with multiple
 alternates or that are more than 1 fragment long.

 From this description it seems to me that these fragments inside a
fragment resemble to the chunks you describe in your original e-mail.
The differences are:
- currently these chunks may be of different sizes as opposed to the
proposed new model of having equally sized chunks
- there is  currently no bitfield to say which chunks are valid
Are these fragments inside a fragment currently used for anything else
than range request acceleration?

Best regards,
Bogdan Graur


Re: Partial object caching

2012-09-05 Thread Alan M. Carroll
Wednesday, September 5, 2012, 3:54:03 AM, you wrote:

 Just for clarification: I did some tests and it seems the entries in the
 alternate vector stored in the first Doc are different versions of the
 same cached object?

Yes, see here:
http://trafficserver.apache.org/docs/trunk/sdk/http-hooks-and-transactions/http-alternate-selection.en.html
 
 As the cached object is represented with a linked list of fragments does
 that mean that in order to serve a specific range from it we have to read
 all fragments (presumably from disk) for the next_key until we get to the
 signifficant fragment?

No. The keys are computationally linked. Given a key for a fragment, the key 
for the next fragment is computed, not read. Therefore you can compute the key 
for fragment i from the key for fragment 0 without I/O.

 From this description it seems to me that these fragments inside a
 fragment resemble to the chunks you describe in your original e-mail.

No, the chunks in my original e-mail would be subranges of these fragments. 
Current fragments are effectively all the same size except the last fragment. 
Chunks are not currently used for anything as they do not exist in the 
current implementation. If implemented as suggested they would be more 
notional, effectively being stored only as the validity bitmap per fragment. 
Disk I/O would continue to be per fragment. Fragments exist to limit disk I/O 
per I/O operation. The range acceleration was bolted on by adding the fragment 
offset tables.

Hmmm, that would be a problem - how can you verify the entire range if it spans 
fragments, as there may by be missing chunks in the middle? You don't want to 
start serving and then discover a missing chunk. Ah, that's why the validity 
bitmaps would be stored in the fragment table, so all of that is available in 
memory before reading any fragments. 




Re: Partial object caching

2012-09-05 Thread Bogdan Graur


 No. The keys are computationally linked. Given a key for a fragment, the
 key for the next fragment is computed, not read.


OK, I think I found the way the next_key is computed from a given key.


 Therefore you can compute the key for fragment i from the key for fragment
 0 without I/O.


Are all Doc structures of the same size? If they differ in size and we do
not have somewhere in RAM (or in the first fragment) each of their sizes
how can we get away without any I/O for the non relevant Docs when serving
a range request?


 Hmmm, that would be a problem - how can you verify the entire range if it
 spans fragments, as there may by be missing chunks in the middle? You don't
 want to start serving and then discover a missing chunk. Ah, that's why the
 validity bitmaps would be stored in the fragment table, so all of that is
 available in memory before reading any fragments.


So the fragments table entry will be augmented with a new uint64_t field
which holds the chunk validity information.
But after computing all the necessary Doc structures for satisfying a
range request they still have to be read from disk (not the content, just
the first part containing the fragments table) to check for chunk validity.
Only when we know we have the complete range valid we may serve the range
from the cache.

Thank you very much for pointing me to the Traffic Server Programming
Guide!


Re: Partial object caching

2012-09-05 Thread Alan M. Carroll
Wednesday, September 5, 2012, 4:16:42 PM, you wrote:

 Are all Doc structures of the same size?

Mostly, but you don't need their actual size, only the size of the content per 
Doc instance, which is implicit in the fragment offset table. Given a range 
request we can walk the fragment offset table (which was read with the 
First/Earliest Doc) until we find the fragment which contains the starting byte 
of the range and being loading fragments with that one.
 
 So the fragments table entry will be augmented with a new uint64_t field
 which holds the chunk validity information.

Right.

 But after computing all the necessary Doc structures for satisfying a
 range request they still have to be read from disk (not the content, just
 the first part containing the fragments table) to check for chunk validity.

No. The fragment offset table will be moved in to the alternate header, which 
is stored in First Doc. Once that is read you have all of the fragment offset 
and chunk validity data in memory. The entire table, with all offsets and 
validity bitmaps, is a single chunk of data in the alternate header.

Again, this is just a proposal to stimulate discussion, it is *not* a committed 
plan of action.

 Only when we know we have the complete range valid we may serve the range
 from the cache.

Yes, and that can be computed from in memory data as soon as the alternate is 
selected.



Re: Partial object caching

2012-08-30 Thread Bogdan Graur
Thank you for taking the time to draw that diagram!

Looking at this diagram and the code some things are not clear for me:

1. Is there a one to many relation between a cached object and a Doc
structure? In other words: many Doc structure hold the data for a cached
object? The diagram says that only first Doc in the chain will contain
the alternate information vector. This suggests there are many Doc
structure associated to a cached object.

2. A Doc structure contains a table of fragments which are represented as
uint64_t offsets. I understand that you plan to move this table to the
alternate information. But it is not clear to me where are these offsets
pointing into currently? Where is data with offset zero held?

Best regards,
Bogdan Graur


On Wed, Aug 29, 2012 at 2:58 PM, Alan M. Carroll 
a...@network-geographics.com wrote:

 Kinda sorta of. I have made a simple diagram with some of this information:

 http://people.apache.org/~amc/ats-cache-layout.jpg

 which you might find useful. The proposed change would be quite similar to
 the existing cache structure but would have a couple of major differences -

 1) Support having fallow fragments. The current logic requires all
 fragments to be present in the cache and cannot detect or handle missing
 fragments.

 2) Subdividing each fragment in to chunks. All of that would be new.

 This would also require moving the fragment offset data in to the
 alternate information but I already have a patch for that which will be
 landing on trunk Real Soon Now.

 Tuesday, August 28, 2012, 1:14:48 PM, you wrote:

  This model of representing a cached object indeed allows holding only
  partial data for the resource.

  Is this the current way a cached object is currently represented in TS?




Re: Partial object caching

2012-08-30 Thread Alan M. Carroll
Thursday, August 30, 2012, 5:29:16 PM, you wrote:

 1. Is there a one to many relation between a cached object and a Doc
 structure?

Yes, that's the chained Docs in the lower right. Each Doc represents a fragment 
and we have discussed previously how objects are stored as an ordered set of 
fragments.

 2. A Doc structure contains a table of fragments which are represented as
 uint64_t offsets.

The fragment offsets are the location in the object of the fragment data. 
frag_offset[i] is the address of the first byte past the end of fragment i. To 
simplify, presume fragments are at most 100 bytes long and we have an object 
that is 390 bytes long in four fragments (0,1,2,3). Then frag_offset[1] could 
be 201 which would mean the next byte past the end of fragment 1 is byte 201 of 
390 in the original object (or equivalently that the first byte of fragment 2 
is byte 201 of 390 in the object). This data is used to accelerate range 
requests so that if you had a request for bytes 220-300 of the object you could 
skip immediately to fragment 2 without reading fragment 1. In real life there 
are some very large (20M+) files out there and being able to skip reading the 
first 10M or so from disk is an enormous performance improvement. Note the 
earliest doc for an alternate is always read because of the way the logic is 
done now, although in theory that could be skipped because you the fragment 
data you need is in the First Doc which doesn't contain any object data for 
objects with multiple alternates or that are more than 1 fragment long.



Re: Partial object caching

2012-08-29 Thread Alan M. Carroll
Kinda sorta of. I have made a simple diagram with some of this information:

http://people.apache.org/~amc/ats-cache-layout.jpg

which you might find useful. The proposed change would be quite similar to the 
existing cache structure but would have a couple of major differences -

1) Support having fallow fragments. The current logic requires all fragments 
to be present in the cache and cannot detect or handle missing fragments.

2) Subdividing each fragment in to chunks. All of that would be new.

This would also require moving the fragment offset data in to the alternate 
information but I already have a patch for that which will be landing on trunk 
Real Soon Now.

Tuesday, August 28, 2012, 1:14:48 PM, you wrote:

 This model of representing a cached object indeed allows holding only
 partial data for the resource.

 Is this the current way a cached object is currently represented in TS?



Re: Partial object caching

2012-08-28 Thread Bogdan Graur
This model of representing a cached object indeed allows holding only
partial data for the resource.

Is this the current way a cached object is currently represented in TS?



On Fri, Aug 24, 2012 at 2:19 PM, Alan M. Carroll 
a...@network-geographics.com wrote:

 Thursday, August 23, 2012, 9:00:29 PM, you wrote:

  If 1 of N fragments is cached, can we currently serve ranges out of that?

 No.




Re: Partial object caching

2012-08-23 Thread James Peach
On 23/08/2012, at 5:23 PM, Alan M. Carroll a...@network-geographics.com wrote:

 Since this has been a topic for a while, I will just throw out an idea to see 
 how fast you guys can shoot it down.
 
 A cache object is stored as a series of fragments. If we subdivided each 
 fragment in to chunks, we could have 64 chunks / fragment and represent 
 them with a bitmap in a single uint64_t. A set bit would indicate valid data 
 in that chunk. Partial content would be written only in chunk units and only 
 for chunks that are complete in the data. For the default size of 1M 
 fragments, each chunk would be 16K which seems a reasonable value. The 
 bitmaps would be stored along with the fragment offset table in the alternate 
 info header. This would keep it out of the directory while making it 
 available when serving because the alternate data is loaded before that 
 point. Range validity checks could also be done without additional disk I/O 
 because you can't detect if a range is valid for an object before the 
 alternate is determined. We could only serve if the request range was 
 completely covered, or generate a synthetic range request to cover parts that 
 were not already in the cache.
 
 This would mean that files less than one fragment would not have partial 
 content cached but I think that's acceptable as the advantages of partial 
 caching are only for larger objects.

If 1 of N fragments is cached, can we currently serve ranges out of that?

J