Re: Use of posix_fadvise

2016-10-18 Thread Nate McCall
Benedict mentioned it briefly above, but the earliest / most detailed
conversation on this can be found in CASSANDRA-1470.

You may also find some stuff from the dev list and other issues around
the same time from specifically from Chris G. and Peter S. (names on
that ticket) as, IIRC, they were both doing a lot of experimentation
at the time tuning Cassandra IO for their hardware platforms. Both
those guys were knowledgeable/approachable - pretty sure they wouldn't
mind getting pinged directly for something interesting like this via
the twitters or some such.

-N

On Wed, Oct 19, 2016 at 6:28 AM, Michael Kjellman
 wrote:
> Sorry, No. Always document your assumptions. I shouldn't need to git blame a 
> thousand commits and read thru a billion tickets to maybe understand why 
> something was done. Clearly thru the conversations on this topic I've had on 
> IRC and the responses so far on this email thread it's not/still not obvious.
>
> best,
> kjellman
>
> On Oct 18, 2016, at 10:07 AM, Benedict Elliott Smith 
> > wrote:
>
> This is what JIRA is for.
>


Re: Use of posix_fadvise

2016-10-18 Thread Benedict Elliott Smith
I'm not certain this is the best way to go about encouraging people to help
you, or generally encourage participation in the project.  You have seemed
to lash out at the project (and in this case me specifically) in a fairly
antagonistic manner a multitude of times in just a couple of hours.

Your original question, on zero, predates anything I know about.  JIRA is
your best bet, and provides historical context that is never going to live
in comments.  I did not imply that the comments were adequate, only that *this
is where you should probably look to answer your question.  *Comment policy
and norms have changed a lot throughout Cassandra's history, and you're
asking about a time that predates the current level of maturity, but JIRA
has always been (AFAIK) the main source of historical context.  I attempted
to provide some links into this to save you from the "billion" (handful) of
tickets.

I don't have time for another flamewar, so I will leave out trying to
assist you in future.




On 18 October 2016 at 18:28, Michael Kjellman 
wrote:

> Sorry, No. Always document your assumptions. I shouldn't need to git blame
> a thousand commits and read thru a billion tickets to maybe understand why
> something was done. Clearly thru the conversations on this topic I've had
> on IRC and the responses so far on this email thread it's not/still not
> obvious.
>
> best,
> kjellman
>
> On Oct 18, 2016, at 10:07 AM, Benedict Elliott Smith  > wrote:
>
> This is what JIRA is for.
>
>


Re: Use of posix_fadvise

2016-10-18 Thread Michael Kjellman
Sorry, No. Always document your assumptions. I shouldn't need to git blame a 
thousand commits and read thru a billion tickets to maybe understand why 
something was done. Clearly thru the conversations on this topic I've had on 
IRC and the responses so far on this email thread it's not/still not obvious.

best,
kjellman

On Oct 18, 2016, at 10:07 AM, Benedict Elliott Smith 
> wrote:

This is what JIRA is for.



Re: Use of posix_fadvise

2016-10-18 Thread Benedict Elliott Smith
This is what JIRA is for.  It seems to date back to CASSANDRA-1470, where
the default became immediately evicting newly compacted files.

This results in cold reads for *hot* data after compaction, so
CASSANDRA-6916 permitted evicting the *old* data instead, while
guaranteeing >= the same amount of eviction.

Whether or not the original issue of cold compaction data was a pain point,
I cannot attest, but I was assured (by whom, I do not recall) that it was.
In its present form it is at least not harmful.  It was (and is) not a
no-op:

http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.6916v3-preempive-open-compact.mixed.2.json=op_rate=mixed=1_aggregates=true=0=545.6=0=114638.7




On 18 October 2016 at 17:42, Michael Kjellman 
wrote:

> Yeah, it has been there for years -- that being said most of the community
> is just catching up to 2.1 and 3.0 now where the usage did appear to change
> over 2.0-- and I'm more trying to figure out what the intent was in the
> various usages all over the codebase and make sure it's actually doing
> that. Maybe even add some comments about that intent. :)
>
> In 2.1 I saw that we were doing this to get the file descriptor in some
> cases (which obviously will return the wrong file descriptor so most likely
> would have made this even more of a potential no-op than it already was?):
>
> public static int getfd(String path)
> {
> RandomAccessFile file = null;
> try
> {
> file = new RandomAccessFile(path, "r");
> return getfd(file.getFD());
> }
> catch (Throwable t)
> {
> JVMStabilityInspector.inspectThrowable(t);
> // ignore
> return -1;
> }
> finally
> {
> try
> {
> if (file != null)
> file.close();
> }
> catch (Throwable t)
> {
> // ignore
> }
> }
> }
>
>
> On Oct 18, 2016, at 9:34 AM, Jake Luciani  s...@gmail.com>> wrote:
>
> Although given we have an in process page cache[1] now this may not be
> needed anymore?
> This is only for the data file though.  I think its been years? since we
> showed it helped so perhaps someone should show if this is still
> working/helping in the real world.
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-5863
>
>
> On Tue, Oct 18, 2016 at 11:59 AM, Michael Kjellman <
> mkjell...@internalcircle.com> wrote:
>
> Specifically regarding the behavior in different kernels, from `man
> posix_fadvise`: "In kernels before 2.6.6, if len was specified as 0, then
> this was interpreted literally as "zero bytes", rather than as meaning "all
> bytes through to the end of the file"."
>
> On Oct 18, 2016, at 8:57 AM, Michael Kjellman <
> mkjell...@internalcircle.com mkjell...@internalcircle.com>> wrote:
>
> Right, so in SSTableReader#GlobalTidy$tidy it does:
> // don't ideally want to dropPageCache for the file until all instances
> have been released
> CLibrary.trySkipCache(desc.filenameFor(Component.DATA), 0, 0);
> CLibrary.trySkipCache(desc.filenameFor(Component.PRIMARY_INDEX), 0, 0);
>
> It seems to me every time the reference is released on a new sstable we
> would immediately tidy() it and then call posix_fadvise with
> POSIX_FADV_DONTNEED with an offset of 0 and a length of 0 (which I'm
> thinking is doing so in respect to the API behavior in modern Linux kernel
> builds?). Am I reading things correctly here? Sorta hard as there are many
> different code paths the reference could have tidy() called.
>
> Why would we want to drop the segment we just write from the page cache --
> wouldn't that most likely be the most hot data, and even if it turned out
> not to be wouldn't it be better in this case to have kernel be smart at
> what it's best at?
>
> best,
> kjellman
>
> On Oct 18, 2016, at 8:50 AM, Jake Luciani  s...@gmail.com> s...@gmail.com>> wrote:
>
> The main point is to avoid keeping things in the page cache that are no
> longer needed like compacted data that has been early opened elsewhere.
>
> On Oct 18, 2016 11:29 AM, "Michael Kjellman"  
> >
> wrote:
>
> We use posix_fadvise in a bunch of places, and in stereotypical Cassandra
> fashion no comments were provided.
>
> There is a check the OS is Linux (okay, a start) but it turns out the
> behavior of providing a length of 0 to posix_fadvise changed in some 2.6
> kernels. We don't check the kernel version -- or even note it.
>
> What is the *expected* outcome of our use of posix_fadvise -- not what
> does it do or not do today -- but what problem was it added to solve and
> what's the expected behavior regardless of kernel versions.
>
> best,
> kjellman
>
> Sent from my iPhone
>
>
>
>
>

Re: Use of posix_fadvise

2016-10-18 Thread Michael Kjellman
Yeah, it has been there for years -- that being said most of the community is 
just catching up to 2.1 and 3.0 now where the usage did appear to change over 
2.0-- and I'm more trying to figure out what the intent was in the various 
usages all over the codebase and make sure it's actually doing that. Maybe even 
add some comments about that intent. :)

In 2.1 I saw that we were doing this to get the file descriptor in some cases 
(which obviously will return the wrong file descriptor so most likely would 
have made this even more of a potential no-op than it already was?):

public static int getfd(String path)
{
RandomAccessFile file = null;
try
{
file = new RandomAccessFile(path, "r");
return getfd(file.getFD());
}
catch (Throwable t)
{
JVMStabilityInspector.inspectThrowable(t);
// ignore
return -1;
}
finally
{
try
{
if (file != null)
file.close();
}
catch (Throwable t)
{
// ignore
}
}
}


On Oct 18, 2016, at 9:34 AM, Jake Luciani 
> wrote:

Although given we have an in process page cache[1] now this may not be
needed anymore?
This is only for the data file though.  I think its been years? since we
showed it helped so perhaps someone should show if this is still
working/helping in the real world.

[1] https://issues.apache.org/jira/browse/CASSANDRA-5863


On Tue, Oct 18, 2016 at 11:59 AM, Michael Kjellman <
mkjell...@internalcircle.com> wrote:

Specifically regarding the behavior in different kernels, from `man
posix_fadvise`: "In kernels before 2.6.6, if len was specified as 0, then
this was interpreted literally as "zero bytes", rather than as meaning "all
bytes through to the end of the file"."

On Oct 18, 2016, at 8:57 AM, Michael Kjellman <
mkjell...@internalcircle.com>
 wrote:

Right, so in SSTableReader#GlobalTidy$tidy it does:
// don't ideally want to dropPageCache for the file until all instances
have been released
CLibrary.trySkipCache(desc.filenameFor(Component.DATA), 0, 0);
CLibrary.trySkipCache(desc.filenameFor(Component.PRIMARY_INDEX), 0, 0);

It seems to me every time the reference is released on a new sstable we
would immediately tidy() it and then call posix_fadvise with
POSIX_FADV_DONTNEED with an offset of 0 and a length of 0 (which I'm
thinking is doing so in respect to the API behavior in modern Linux kernel
builds?). Am I reading things correctly here? Sorta hard as there are many
different code paths the reference could have tidy() called.

Why would we want to drop the segment we just write from the page cache --
wouldn't that most likely be the most hot data, and even if it turned out
not to be wouldn't it be better in this case to have kernel be smart at
what it's best at?

best,
kjellman

On Oct 18, 2016, at 8:50 AM, Jake Luciani 
>> wrote:

The main point is to avoid keeping things in the page cache that are no
longer needed like compacted data that has been early opened elsewhere.

On Oct 18, 2016 11:29 AM, "Michael Kjellman" 

>
wrote:

We use posix_fadvise in a bunch of places, and in stereotypical Cassandra
fashion no comments were provided.

There is a check the OS is Linux (okay, a start) but it turns out the
behavior of providing a length of 0 to posix_fadvise changed in some 2.6
kernels. We don't check the kernel version -- or even note it.

What is the *expected* outcome of our use of posix_fadvise -- not what
does it do or not do today -- but what problem was it added to solve and
what's the expected behavior regardless of kernel versions.

best,
kjellman

Sent from my iPhone





--
http://twitter.com/tjake



Re: Use of posix_fadvise

2016-10-18 Thread Jake Luciani
Although given we have an in process page cache[1] now this may not be
needed anymore?
This is only for the data file though.  I think its been years? since we
showed it helped so perhaps someone should show if this is still
working/helping in the real world.

[1] https://issues.apache.org/jira/browse/CASSANDRA-5863


On Tue, Oct 18, 2016 at 11:59 AM, Michael Kjellman <
mkjell...@internalcircle.com> wrote:

> Specifically regarding the behavior in different kernels, from `man
> posix_fadvise`: "In kernels before 2.6.6, if len was specified as 0, then
> this was interpreted literally as "zero bytes", rather than as meaning "all
> bytes through to the end of the file"."
>
> On Oct 18, 2016, at 8:57 AM, Michael Kjellman <
> mkjell...@internalcircle.com> wrote:
>
> Right, so in SSTableReader#GlobalTidy$tidy it does:
> // don't ideally want to dropPageCache for the file until all instances
> have been released
> CLibrary.trySkipCache(desc.filenameFor(Component.DATA), 0, 0);
> CLibrary.trySkipCache(desc.filenameFor(Component.PRIMARY_INDEX), 0, 0);
>
> It seems to me every time the reference is released on a new sstable we
> would immediately tidy() it and then call posix_fadvise with
> POSIX_FADV_DONTNEED with an offset of 0 and a length of 0 (which I'm
> thinking is doing so in respect to the API behavior in modern Linux kernel
> builds?). Am I reading things correctly here? Sorta hard as there are many
> different code paths the reference could have tidy() called.
>
> Why would we want to drop the segment we just write from the page cache --
> wouldn't that most likely be the most hot data, and even if it turned out
> not to be wouldn't it be better in this case to have kernel be smart at
> what it's best at?
>
> best,
> kjellman
>
> On Oct 18, 2016, at 8:50 AM, Jake Luciani  s...@gmail.com>> wrote:
>
> The main point is to avoid keeping things in the page cache that are no
> longer needed like compacted data that has been early opened elsewhere.
>
> On Oct 18, 2016 11:29 AM, "Michael Kjellman"  >
> wrote:
>
> We use posix_fadvise in a bunch of places, and in stereotypical Cassandra
> fashion no comments were provided.
>
> There is a check the OS is Linux (okay, a start) but it turns out the
> behavior of providing a length of 0 to posix_fadvise changed in some 2.6
> kernels. We don't check the kernel version -- or even note it.
>
> What is the *expected* outcome of our use of posix_fadvise -- not what
> does it do or not do today -- but what problem was it added to solve and
> what's the expected behavior regardless of kernel versions.
>
> best,
> kjellman
>
> Sent from my iPhone
>
>
>


-- 
http://twitter.com/tjake


Re: Use of posix_fadvise

2016-10-18 Thread Ariel Weisberg
Hi,

Compaction can merge some very large files together with data that may
be completely cold. So yeah caching the whole file just creates pressure
to evict useful stuff. In some theories.

In other theories the page cache is flush and scan resistant and should
just eat this stuff up without intervention. Sure it might hurt a bit,
but it's a bounded amount before the cache stops discarding useful stuff
in favor of new stuff that is unproven.

If there is a benchmark with this enabled/disabled I haven't seen it.
Doesn't mean it doesn't exist though.

Ariel
On Tue, Oct 18, 2016, at 12:05 PM, Michael Kjellman wrote:
> Within a single SegmentedFile?
> 
> On Oct 18, 2016, at 9:02 AM, Ariel Weisberg
> > wrote:
> 
> With compaction there can be hot and cold data mixed together.
> 


Re: Use of posix_fadvise

2016-10-18 Thread Michael Kjellman
Within a single SegmentedFile?

On Oct 18, 2016, at 9:02 AM, Ariel Weisberg 
> wrote:

With compaction there can be hot and cold data mixed together.



Re: Use of posix_fadvise

2016-10-18 Thread Ariel Weisberg
Hi,

With compaction there can be hot and cold data mixed together. So we want
to drop the data and then warm it via early opening so only the hot data is
in the cache.

Some of those cases are for the old sstable that have been rewritten or
discarded so the data is entirely defunct. The files might not get deleted
though so they do add pressure to the cache until they are evicted.

In the instance you are looking at in a tidier won't there always be a
reference held in the current view for the column family? It don't think it
would constantly be evicting them nor closing/reopening and remapping the
file.


Specifically regarding the behavior in different kernels, from `man
> posix_fadvise`: "In kernels before 2.6.6, if len was specified as 0, then
> this was interpreted literally as "zero bytes", rather than as meaning "all
> bytes through to the end of the file"."

Not ideal, but at least not actively harmful right? The cache is supposed
to be scan/flush resistant.

Ariel

On Tue, Oct 18, 2016 at 11:57 AM, Michael Kjellman <
mkjell...@internalcircle.com> wrote:

> Right, so in SSTableReader#GlobalTidy$tidy it does:
> // don't ideally want to dropPageCache for the file until all instances
> have been released
> CLibrary.trySkipCache(desc.filenameFor(Component.DATA), 0, 0);
> CLibrary.trySkipCache(desc.filenameFor(Component.PRIMARY_INDEX), 0, 0);
>
> It seems to me every time the reference is released on a new sstable we
> would immediately tidy() it and then call posix_fadvise with
> POSIX_FADV_DONTNEED with an offset of 0 and a length of 0 (which I'm
> thinking is doing so in respect to the API behavior in modern Linux kernel
> builds?). Am I reading things correctly here? Sorta hard as there are many
> different code paths the reference could have tidy() called.
>
> Why would we want to drop the segment we just write from the page cache --
> wouldn't that most likely be the most hot data, and even if it turned out
> not to be wouldn't it be better in this case to have kernel be smart at
> what it's best at?
>
> best,
> kjellman
>
> > On Oct 18, 2016, at 8:50 AM, Jake Luciani  wrote:
> >
> > The main point is to avoid keeping things in the page cache that are no
> > longer needed like compacted data that has been early opened elsewhere.
> >
> > On Oct 18, 2016 11:29 AM, "Michael Kjellman" <
> mkjell...@internalcircle.com>
> > wrote:
> >
> >> We use posix_fadvise in a bunch of places, and in stereotypical
> Cassandra
> >> fashion no comments were provided.
> >>
> >> There is a check the OS is Linux (okay, a start) but it turns out the
> >> behavior of providing a length of 0 to posix_fadvise changed in some 2.6
> >> kernels. We don't check the kernel version -- or even note it.
> >>
> >> What is the *expected* outcome of our use of posix_fadvise -- not what
> >> does it do or not do today -- but what problem was it added to solve and
> >> what's the expected behavior regardless of kernel versions.
> >>
> >> best,
> >> kjellman
> >>
> >> Sent from my iPhone
>
>


Re: Use of posix_fadvise

2016-10-18 Thread Michael Kjellman
Right, so in SSTableReader#GlobalTidy$tidy it does:
// don't ideally want to dropPageCache for the file until all instances have 
been released
CLibrary.trySkipCache(desc.filenameFor(Component.DATA), 0, 0);
CLibrary.trySkipCache(desc.filenameFor(Component.PRIMARY_INDEX), 0, 0);

It seems to me every time the reference is released on a new sstable we would 
immediately tidy() it and then call posix_fadvise with POSIX_FADV_DONTNEED with 
an offset of 0 and a length of 0 (which I'm thinking is doing so in respect to 
the API behavior in modern Linux kernel builds?). Am I reading things correctly 
here? Sorta hard as there are many different code paths the reference could 
have tidy() called.

Why would we want to drop the segment we just write from the page cache -- 
wouldn't that most likely be the most hot data, and even if it turned out not 
to be wouldn't it be better in this case to have kernel be smart at what it's 
best at?

best,
kjellman

> On Oct 18, 2016, at 8:50 AM, Jake Luciani  wrote:
> 
> The main point is to avoid keeping things in the page cache that are no
> longer needed like compacted data that has been early opened elsewhere.
> 
> On Oct 18, 2016 11:29 AM, "Michael Kjellman" 
> wrote:
> 
>> We use posix_fadvise in a bunch of places, and in stereotypical Cassandra
>> fashion no comments were provided.
>> 
>> There is a check the OS is Linux (okay, a start) but it turns out the
>> behavior of providing a length of 0 to posix_fadvise changed in some 2.6
>> kernels. We don't check the kernel version -- or even note it.
>> 
>> What is the *expected* outcome of our use of posix_fadvise -- not what
>> does it do or not do today -- but what problem was it added to solve and
>> what's the expected behavior regardless of kernel versions.
>> 
>> best,
>> kjellman
>> 
>> Sent from my iPhone



Re: Use of posix_fadvise

2016-10-18 Thread Jake Luciani
The main point is to avoid keeping things in the page cache that are no
longer needed like compacted data that has been early opened elsewhere.

On Oct 18, 2016 11:29 AM, "Michael Kjellman" 
wrote:

> We use posix_fadvise in a bunch of places, and in stereotypical Cassandra
> fashion no comments were provided.
>
> There is a check the OS is Linux (okay, a start) but it turns out the
> behavior of providing a length of 0 to posix_fadvise changed in some 2.6
> kernels. We don't check the kernel version -- or even note it.
>
> What is the *expected* outcome of our use of posix_fadvise -- not what
> does it do or not do today -- but what problem was it added to solve and
> what's the expected behavior regardless of kernel versions.
>
> best,
> kjellman
>
> Sent from my iPhone


Re: Use of posix_fadvise

2016-10-18 Thread Michael Kjellman
Sure -- my bad, I aggregated them all of them up for you:
https://github.com/apache/cassandra/search?utf8=✓=CLibrary.trySkipCache=Code
https://github.com/apache/cassandra/blob/81f6c784ce967fadb6ed7f58de1328e713eaf53c/test/unit/org/apache/cassandra/utils/CLibraryTest.java#L34
https://github.com/apache/cassandra/blob/81f6c784ce967fadb6ed7f58de1328e713eaf53c/src/java/org/apache/cassandra/db/commitlog/MemoryMappedSegment.java#L102
https://github.com/apache/cassandra/blob/81f6c784ce967fadb6ed7f58de1328e713eaf53c/src/java/org/apache/cassandra/hints/ChecksummedDataInput.java#L218
https://github.com/apache/cassandra/blob/81f6c784ce967fadb6ed7f58de1328e713eaf53c/src/java/org/apache/cassandra/hints/HintsWriter.java#L292
https://github.com/apache/cassandra/blob/81f6c784ce967fadb6ed7f58de1328e713eaf53c/src/java/org/apache/cassandra/io/util/FileHandle.java#L167
https://github.com/apache/cassandra/blob/81f6c784ce967fadb6ed7f58de1328e713eaf53c/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java#L174
https://github.com/apache/cassandra/blob/f2a354763877cfeaf1dd017b84a7c8ee9eafd885/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L2281
https://github.com/apache/cassandra/blob/f2a354763877cfeaf1dd017b84a7c8ee9eafd885/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L2282

Or if you use IDEA this should work pretty well too:
[cid:543B66BF-5E99-4227-A24D-1AB8C0341D97@localhost]

best,
kjellman


On Oct 18, 2016, at 8:33 AM, Benedict Elliott Smith 
> wrote:

... and continuing in the fashion of behaviours one might like to disabuse
people of, no code link is provided.



On 18 October 2016 at 16:28, Michael Kjellman 
>
wrote:

We use posix_fadvise in a bunch of places, and in stereotypical Cassandra
fashion no comments were provided.

There is a check the OS is Linux (okay, a start) but it turns out the
behavior of providing a length of 0 to posix_fadvise changed in some 2.6
kernels. We don't check the kernel version -- or even note it.

What is the *expected* outcome of our use of posix_fadvise -- not what
does it do or not do today -- but what problem was it added to solve and
what's the expected behavior regardless of kernel versions.

best,
kjellman

Sent from my iPhone



Re: Use of posix_fadvise

2016-10-18 Thread Benedict Elliott Smith
... and continuing in the fashion of behaviours one might like to disabuse
people of, no code link is provided.



On 18 October 2016 at 16:28, Michael Kjellman 
wrote:

> We use posix_fadvise in a bunch of places, and in stereotypical Cassandra
> fashion no comments were provided.
>
> There is a check the OS is Linux (okay, a start) but it turns out the
> behavior of providing a length of 0 to posix_fadvise changed in some 2.6
> kernels. We don't check the kernel version -- or even note it.
>
> What is the *expected* outcome of our use of posix_fadvise -- not what
> does it do or not do today -- but what problem was it added to solve and
> what's the expected behavior regardless of kernel versions.
>
> best,
> kjellman
>
> Sent from my iPhone