Re: [RFC] Heads up on sys_fallocate()

2007-03-13 Thread David Chinner
On Tue, Mar 06, 2007 at 10:46:56AM -0600, Eric Sandeen wrote:
> Ulrich Drepper wrote:
> > Christoph Hellwig wrote:
> >> fallocate with the whence argument and flags is already quite complicated,
> >> I'd rather have another call for placement decisions, that would
> >> be called on an fd to do placement decissions for any further allocations
> >> (prealloc, write, etc)
> > 
> > Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> > understand why requesting linear layout of the blocks should be an
> > option.  It's always an advantage if the blocks requested this way are
> > linear on disk.  So, the kernel should always do its best to make this
> > happen, without needing an additional option.
> > 
> 
> Agreed on both points.  The hints would be for things like start block,
> or speculative EOF preallocation, not contiguity, which I think should
> always be the goal.

ISTR having had this discussion before ;)

About guided preallocation for defrag:

http://marc.info/?t=11624785951=1=2

e.g.: The sorts of policies we need for effective use of
preallocation:

http://marc.info/?l=linux-fsdevel=116184475308164=2
http://marc.info/?l=linux-fsdevel=116278169519095=2

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-13 Thread David Chinner
On Tue, Mar 06, 2007 at 10:46:56AM -0600, Eric Sandeen wrote:
 Ulrich Drepper wrote:
  Christoph Hellwig wrote:
  fallocate with the whence argument and flags is already quite complicated,
  I'd rather have another call for placement decisions, that would
  be called on an fd to do placement decissions for any further allocations
  (prealloc, write, etc)
  
  Yes, posix_fallocate shouldn't be made more complicated.  But I don't
  understand why requesting linear layout of the blocks should be an
  option.  It's always an advantage if the blocks requested this way are
  linear on disk.  So, the kernel should always do its best to make this
  happen, without needing an additional option.
  
 
 Agreed on both points.  The hints would be for things like start block,
 or speculative EOF preallocation, not contiguity, which I think should
 always be the goal.

ISTR having had this discussion before ;)

About guided preallocation for defrag:

http://marc.info/?t=11624785951r=1w=2

e.g.: The sorts of policies we need for effective use of
preallocation:

http://marc.info/?l=linux-fsdevelm=116184475308164w=2
http://marc.info/?l=linux-fsdevelm=116278169519095w=2

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-07 Thread Jörn Engel
On Wed, 7 March 2007 09:51:35 +0100, Jan Kara wrote:
>
>   I'll probably first write some userspace fs-reorganizer to find out how
> much these changes in layout are able to give you in performance (i.e.
> whether it's worth the effort of more complicated kernel online
> defragmenter).

Have tried profiling the read accesses and prereading them
asynchronously on startup?  That appears to have improved E17 a lot.
See http://lca2007.linux.org.au/talk/101 (and watch the video).

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-07 Thread Jan Kara
On Tue 06-03-07 12:23:22, Eric Sandeen wrote:
> Jan Kara wrote:
> > On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
> >> Christoph Hellwig wrote:
> >>> fallocate with the whence argument and flags is already quite complicated,
> >>> I'd rather have another call for placement decisions, that would
> >>> be called on an fd to do placement decissions for any further allocations
> >>> (prealloc, write, etc)
> >> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> >> understand why requesting linear layout of the blocks should be an
> >> option.  It's always an advantage if the blocks requested this way are
> >> linear on disk.  So, the kernel should always do its best to make this
> >> happen, without needing an additional option.
> >   Actually, it's not that simple. You want linear layout of blocks you are
> > going to read. That is not necessary a linear layout of blocks in a single
> > file - trace sometime a start of some complicated app like KDE. You find
> > it's seeking like a hell because it needs a few blocks from a ton of
> > distinct files (shared libs, config files, etc). As these files are mostly
> > read only, it's advantageous to interleave them on disk or at least keep
> > them close.
> 
> At some point shouldn't the apps be fixed, rather than do crazy things
> with the filesystem?  :)
  Yes :) That's basically what we told KDE developpers when they were
complaining ;) But it's hard to fix it for them too (because of some
desktop specs requiring lots of different text config files which can
change anytime - don't ask me who designed it). Moreover for example for
loading shared libraries from which you need just a few blocks scattered
all over the place the problem is in ELF itself.
  I'll probably first write some userspace fs-reorganizer to find out how
much these changes in layout are able to give you in performance (i.e.
whether it's worth the effort of more complicated kernel online
defragmenter).

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-07 Thread Jan Kara
On Tue 06-03-07 12:23:22, Eric Sandeen wrote:
 Jan Kara wrote:
  On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
  Christoph Hellwig wrote:
  fallocate with the whence argument and flags is already quite complicated,
  I'd rather have another call for placement decisions, that would
  be called on an fd to do placement decissions for any further allocations
  (prealloc, write, etc)
  Yes, posix_fallocate shouldn't be made more complicated.  But I don't
  understand why requesting linear layout of the blocks should be an
  option.  It's always an advantage if the blocks requested this way are
  linear on disk.  So, the kernel should always do its best to make this
  happen, without needing an additional option.
Actually, it's not that simple. You want linear layout of blocks you are
  going to read. That is not necessary a linear layout of blocks in a single
  file - trace sometime a start of some complicated app like KDE. You find
  it's seeking like a hell because it needs a few blocks from a ton of
  distinct files (shared libs, config files, etc). As these files are mostly
  read only, it's advantageous to interleave them on disk or at least keep
  them close.
 
 At some point shouldn't the apps be fixed, rather than do crazy things
 with the filesystem?  :)
  Yes :) That's basically what we told KDE developpers when they were
complaining ;) But it's hard to fix it for them too (because of some
desktop specs requiring lots of different text config files which can
change anytime - don't ask me who designed it). Moreover for example for
loading shared libraries from which you need just a few blocks scattered
all over the place the problem is in ELF itself.
  I'll probably first write some userspace fs-reorganizer to find out how
much these changes in layout are able to give you in performance (i.e.
whether it's worth the effort of more complicated kernel online
defragmenter).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-07 Thread Jörn Engel
On Wed, 7 March 2007 09:51:35 +0100, Jan Kara wrote:

   I'll probably first write some userspace fs-reorganizer to find out how
 much these changes in layout are able to give you in performance (i.e.
 whether it's worth the effort of more complicated kernel online
 defragmenter).

Have tried profiling the read accesses and prereading them
asynchronously on startup?  That appears to have improved E17 a lot.
See http://lca2007.linux.org.au/talk/101 (and watch the video).

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Eric Sandeen
Jan Kara wrote:
> On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
>> Christoph Hellwig wrote:
>>> fallocate with the whence argument and flags is already quite complicated,
>>> I'd rather have another call for placement decisions, that would
>>> be called on an fd to do placement decissions for any further allocations
>>> (prealloc, write, etc)
>> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
>> understand why requesting linear layout of the blocks should be an
>> option.  It's always an advantage if the blocks requested this way are
>> linear on disk.  So, the kernel should always do its best to make this
>> happen, without needing an additional option.
>   Actually, it's not that simple. You want linear layout of blocks you are
> going to read. That is not necessary a linear layout of blocks in a single
> file - trace sometime a start of some complicated app like KDE. You find
> it's seeking like a hell because it needs a few blocks from a ton of
> distinct files (shared libs, config files, etc). As these files are mostly
> read only, it's advantageous to interleave them on disk or at least keep
> them close.

At some point shouldn't the apps be fixed, rather than do crazy things
with the filesystem?  :)

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Eric Sandeen
Ulrich Drepper wrote:
> Christoph Hellwig wrote:
>> fallocate with the whence argument and flags is already quite complicated,
>> I'd rather have another call for placement decisions, that would
>> be called on an fd to do placement decissions for any further allocations
>> (prealloc, write, etc)
> 
> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> understand why requesting linear layout of the blocks should be an
> option.  It's always an advantage if the blocks requested this way are
> linear on disk.  So, the kernel should always do its best to make this
> happen, without needing an additional option.
> 

Agreed on both points.  The hints would be for things like start block,
or speculative EOF preallocation, not contiguity, which I think should
always be the goal.

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Christoph Hellwig
On Tue, Mar 06, 2007 at 06:36:09AM -0800, Ulrich Drepper wrote:
> Christoph Hellwig wrote:
> > fallocate with the whence argument and flags is already quite complicated,
> > I'd rather have another call for placement decisions, that would
> > be called on an fd to do placement decissions for any further allocations
> > (prealloc, write, etc)
> 
> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> understand why requesting linear layout of the blocks should be an
> option.  It's always an advantage if the blocks requested this way are
> linear on disk.  So, the kernel should always do its best to make this
> happen, without needing an additional option.

There are HPC workloads where you have multi writers on multiple machines
that write to different parts of a file.  You preferably want each
of those regions in separate allocation groups.  (Or tell the customers
to use separate files for the regions..)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Jan Kara
On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
> Christoph Hellwig wrote:
> > fallocate with the whence argument and flags is already quite complicated,
> > I'd rather have another call for placement decisions, that would
> > be called on an fd to do placement decissions for any further allocations
> > (prealloc, write, etc)
> 
> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> understand why requesting linear layout of the blocks should be an
> option.  It's always an advantage if the blocks requested this way are
> linear on disk.  So, the kernel should always do its best to make this
> happen, without needing an additional option.
  Actually, it's not that simple. You want linear layout of blocks you are
going to read. That is not necessary a linear layout of blocks in a single
file - trace sometime a start of some complicated app like KDE. You find
it's seeking like a hell because it needs a few blocks from a ton of
distinct files (shared libs, config files, etc). As these files are mostly
read only, it's advantageous to interleave them on disk or at least keep
them close.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Ulrich Drepper
Christoph Hellwig wrote:
> fallocate with the whence argument and flags is already quite complicated,
> I'd rather have another call for placement decisions, that would
> be called on an fd to do placement decissions for any further allocations
> (prealloc, write, etc)

Yes, posix_fallocate shouldn't be made more complicated.  But I don't
understand why requesting linear layout of the blocks should be an
option.  It's always an advantage if the blocks requested this way are
linear on disk.  So, the kernel should always do its best to make this
happen, without needing an additional option.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Ulrich Drepper
Christoph Hellwig wrote:
 fallocate with the whence argument and flags is already quite complicated,
 I'd rather have another call for placement decisions, that would
 be called on an fd to do placement decissions for any further allocations
 (prealloc, write, etc)

Yes, posix_fallocate shouldn't be made more complicated.  But I don't
understand why requesting linear layout of the blocks should be an
option.  It's always an advantage if the blocks requested this way are
linear on disk.  So, the kernel should always do its best to make this
happen, without needing an additional option.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Jan Kara
On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
 Christoph Hellwig wrote:
  fallocate with the whence argument and flags is already quite complicated,
  I'd rather have another call for placement decisions, that would
  be called on an fd to do placement decissions for any further allocations
  (prealloc, write, etc)
 
 Yes, posix_fallocate shouldn't be made more complicated.  But I don't
 understand why requesting linear layout of the blocks should be an
 option.  It's always an advantage if the blocks requested this way are
 linear on disk.  So, the kernel should always do its best to make this
 happen, without needing an additional option.
  Actually, it's not that simple. You want linear layout of blocks you are
going to read. That is not necessary a linear layout of blocks in a single
file - trace sometime a start of some complicated app like KDE. You find
it's seeking like a hell because it needs a few blocks from a ton of
distinct files (shared libs, config files, etc). As these files are mostly
read only, it's advantageous to interleave them on disk or at least keep
them close.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Christoph Hellwig
On Tue, Mar 06, 2007 at 06:36:09AM -0800, Ulrich Drepper wrote:
 Christoph Hellwig wrote:
  fallocate with the whence argument and flags is already quite complicated,
  I'd rather have another call for placement decisions, that would
  be called on an fd to do placement decissions for any further allocations
  (prealloc, write, etc)
 
 Yes, posix_fallocate shouldn't be made more complicated.  But I don't
 understand why requesting linear layout of the blocks should be an
 option.  It's always an advantage if the blocks requested this way are
 linear on disk.  So, the kernel should always do its best to make this
 happen, without needing an additional option.

There are HPC workloads where you have multi writers on multiple machines
that write to different parts of a file.  You preferably want each
of those regions in separate allocation groups.  (Or tell the customers
to use separate files for the regions..)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Eric Sandeen
Ulrich Drepper wrote:
 Christoph Hellwig wrote:
 fallocate with the whence argument and flags is already quite complicated,
 I'd rather have another call for placement decisions, that would
 be called on an fd to do placement decissions for any further allocations
 (prealloc, write, etc)
 
 Yes, posix_fallocate shouldn't be made more complicated.  But I don't
 understand why requesting linear layout of the blocks should be an
 option.  It's always an advantage if the blocks requested this way are
 linear on disk.  So, the kernel should always do its best to make this
 happen, without needing an additional option.
 

Agreed on both points.  The hints would be for things like start block,
or speculative EOF preallocation, not contiguity, which I think should
always be the goal.

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Eric Sandeen
Jan Kara wrote:
 On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
 Christoph Hellwig wrote:
 fallocate with the whence argument and flags is already quite complicated,
 I'd rather have another call for placement decisions, that would
 be called on an fd to do placement decissions for any further allocations
 (prealloc, write, etc)
 Yes, posix_fallocate shouldn't be made more complicated.  But I don't
 understand why requesting linear layout of the blocks should be an
 option.  It's always an advantage if the blocks requested this way are
 linear on disk.  So, the kernel should always do its best to make this
 happen, without needing an additional option.
   Actually, it's not that simple. You want linear layout of blocks you are
 going to read. That is not necessary a linear layout of blocks in a single
 file - trace sometime a start of some complicated app like KDE. You find
 it's seeking like a hell because it needs a few blocks from a ton of
 distinct files (shared libs, config files, etc). As these files are mostly
 read only, it's advantageous to interleave them on disk or at least keep
 them close.

At some point shouldn't the apps be fixed, rather than do crazy things
with the filesystem?  :)

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 12:02:59PM -0800, Mingming Cao wrote:
> Yep, I think it makes sense to use preallocation for defragmentation.
> After all both preallocation and defragmentation shall call underlying 
> filesystem multiple block allocator to try to allocate a chunk of 
> contiguous blocks on disk. ext4 online defrag implementation by Takashi 
> already support to choose a "goal" allocation block to guide the ext4 
> block allocator to place the defraged file is a specific location.
> 
> Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
> and/or whether the goal block is important over the size of prealloc 
> extent), might make it more useful for the orginial goal (get contigous 
> and guranteed blocks) and for defragmentation.

fallocate with the whence argument and flags is already quite complicated,
I'd rather have another call for placement decisions, that would
be called on an fd to do placement decissions for any further allocations
(prealloc, write, etc)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Eric Sandeen
Jörn Engel wrote:
> Does the allocation have to be persistent beyond lifetime of the file
> descriptor?  It would be fairly simple to support the write guarantee
> while the file is open (or rather the inode remains cached) and drop it
> afterwards.

"The posix_fallocate() function shall ensure that any required storage
for regular file data starting at offset and continuing for len bytes is
allocated on the file system storage media."

I interpret "on the storage media" to mean that it is persistent.

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Eric Sandeen
Jan Kara wrote:

>> I am wondering if it is useful to add another mode to advise block 
>> allocation policy? Something like indicating which physical block/block 
>> group to allocate from (goal), and whether ask for strict contigous 
>> blocks. This will help preallocation or reservation to choose the right 
>> blocks for the file.
>   Yes, I also think this would be useful so you can "guide"
> preallocation for things like defragmentation (e.g. preallocate space
> for the file being defragmented and move the file to it).

Hints & policies for allocation would certainly be useful, but I think
they belong outside this interface.  i.e. you could flag an inode for
whatever allocation you choose, and -then- call posix_fallocate so that
the allocator will take the hints you've given it.

See also this blurb from the posix_fallocate definition:

"It is implementation-defined whether a previous posix_fadvise() call
influences allocation strategy."

FWIW I don't see a lot of point in asking for "strict contiguous blocks"
- the allocator will presumeably try to do this in any case, and I'm not
sure when you would want to fail if you get more than one extent...?

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Mingming Cao

Jan Kara wrote:

On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott <[EMAIL PROTECTED]> wrote:




On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:



On Fri, 2 Mar 2007 00:04:45 +0530
"Amit K. Arora" <[EMAIL PROTECTED]> wrote:




This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);


...

I'd agree with Eric on the "command" flag extension.


Seems like a separate syscall would be better, "command" sounds
a bit ioctl like, especially if that command is passed into the
filesystems..




madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to "mode"? ;)



I am wondering if it is useful to add another mode to advise block 
allocation policy? Something like indicating which physical block/block 
group to allocate from (goal), and whether ask for strict contigous 
blocks. This will help preallocation or reservation to choose the right 
blocks for the file.


  Yes, I also think this would be useful so you can "guide"
preallocation for things like defragmentation (e.g. preallocate space
for the file being defragmented and move the file to it).

Honza

Yep, I think it makes sense to use preallocation for defragmentation.
After all both preallocation and defragmentation shall call underlying 
filesystem multiple block allocator to try to allocate a chunk of 
contiguous blocks on disk. ext4 online defrag implementation by Takashi 
already support to choose a "goal" allocation block to guide the ext4 
block allocator to place the defraged file is a specific location.


Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
and/or whether the goal block is important over the size of prealloc 
extent), might make it more useful for the orginial goal (get contigous 
and guranteed blocks) and for defragmentation.


Regards,
Mingming
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Theodore Tso wrote:
> [...] although the libc
> implementation still wouldn't be able to go away for long time due to
> the need to be backwards compatible with older kernels that didn't
> have this support.

It's better than that.  If somebody compiles glibc to not run on older
kernels at all (tested at runtime) then the code is dropped.  E.g., the
current Fedora glibc does not support 2.6.8 or earlier.

So, don't let the compat code be a factor in the decision making.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Theodore Tso
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
> Well, I'm sure the kernel can do better than the code we have in libc
> now.  The kernel has access to the bitmasks which say which blocks have
> already been allocated.  The libc code does not and we have to be very
> simple-minded and simply touch every block.  And this means reading it
> and then writing it back.  The kernel would know when the reading part
> is not necessary.  Add to then the block granularity (we use f_bsize as
> returned from fstatfs but that's not the best value in some cases) and
> you have compelling data to have generic code in the kernel.  Then libc
> implementation can then go away completely which is a good thing.

You have a very good point; indeed since we don't export an interface
which allows userspace to determine whether or not a block is in use,
that does mean a huge amount of churn in the page cache.  So maybe it
would be worth doing in the kernel as a result, although the libc
implementation still wouldn't be able to go away for long time due to
the need to be backwards compatible with older kernels that didn't
have this support.

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
>> Of course.  You call posix_fallocate once for the lifetime of the file
>> when it is created to ensure that all future uses will work.
> 
> That part is not quite clear from the manpage but I trust most people
> would assume the same.

Not only that, it is what this function is for.  In the POSIX committee
we've looked at the functions in detail before adding them, even if some
information is not in the man page but instead in the Rationale.


> Still, it is quite obvious that noone designing this interface has lost
> much thought to compressing filesystems.

You already have problems with supporting the functionality
posix_fallocate is supporting.  You cannot reliably support MAP_SHARED
files if all of a sudden the compression causes and expansion of a block
and that causes a ENOSPC error.  So, don't expect pity.  This is a
function in support of a real and reliable implementation of memory
mapped files.  You don't use MAP_SHARED on such filesystems, it'll eat
your kittens sooner or later anyway.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel
On Mon, 5 March 2007 07:08:03 -0800, Ulrich Drepper wrote:
> Jörn Engel wrote:
> > Does the allocation have to be persistent beyond lifetime of the file
> > descriptor?
> 
> Of course.  You call posix_fallocate once for the lifetime of the file
> when it is created to ensure that all future uses will work.

That part is not quite clear from the manpage but I trust most people
would assume the same.

> It seems your filesystem will not be able to support this unless
> compression is turned off.

Correct.  Compression needs to be turned off for a file, if
posix_fallocate(3) is to succeed.  What I could do is disable
compression (meaning that no data written in the future will be
compressed) and rewrite all blocks within the given range.

Still, it is quite obvious that noone designing this interface has lost
much thought to compressing filesystems.  Whatever I can come up with
will either be incompatible or some sort of hack.  :(

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
> Theodore Tso wrote:
> > Given that glibc already has to support this for older kernels, I
> > would argue that there's no point putting in generic support for
> > filesystem that can't support a more advanced way of doing things.
> 
> Well, I'm sure the kernel can do better than the code we have in libc
> now.  The kernel has access to the bitmasks which say which blocks have
> already been allocated.

The layer of the kernel where a totally generic fallback would be
implemented does not have access to this information.  We could do
a mostly generic helper for block filesystems that allows to implement
fallocate this way without a lot of their own code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
> The bad news for posix_fallocate() is that even if libc is smart enough
> to write random data, mmap() can still cause problems.

This is not smart, quite to the contrary.  The standard guarantees that
all not-yet-written-to places in the file are zero.  And if a block has
already been written posix_fallocate cannot change it.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
> Does the allocation have to be persistent beyond lifetime of the file
> descriptor?

Of course.  You call posix_fallocate once for the lifetime of the file
when it is created to ensure that all future uses will work.

It seems your filesystem will not be able to support this unless
compression is turned off.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Theodore Tso wrote:
> Given that glibc already has to support this for older kernels, I
> would argue that there's no point putting in generic support for
> filesystem that can't support a more advanced way of doing things.

Well, I'm sure the kernel can do better than the code we have in libc
now.  The kernel has access to the bitmasks which say which blocks have
already been allocated.  The libc code does not and we have to be very
simple-minded and simply touch every block.  And this means reading it
and then writing it back.  The kernel would know when the reading part
is not necessary.  Add to then the block granularity (we use f_bsize as
returned from fstatfs but that's not the best value in some cases) and
you have compelling data to have generic code in the kernel.  Then libc
implementation can then go away completely which is a good thing.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Anton Altaparmakov

On 5 Mar 2007, at 14:37, Theodore Tso wrote:

On Sun, Mar 04, 2007 at 11:22:06PM +, Anton Altaparmakov wrote:

And I specifically did NOT update the initialized size in the inode
thus it will remain at its old value thus all new allocated blocks
will be considered as present but not initialized thus a read will
always return zero whilst a write will do the right thing and pad
with zeroes as necessary (if the write is smaller than the block
size, etc).


You're describing a method of doing in-advance preallocation
where the filesystem format explicitly has support for this kind of
feature in a way that doesn't require pre-zeroing the data blocks in
question.


Indeed.


The question which this subthread was concerned about was
whether the kernel should get involved in initializing datablocks in
the case where the filesystem format does not have this support, or
whether this functionality should continue to be done in userspace.
Given that glibc already has to support this for older kernels, I
would argue that there's no point putting in generic support for
filesystem that can't support a more advanced way of doing things.


Yes, I understood that after I had sent my post...  And yes, I would  
agree.  If glibc already does this there does not appear to be any  
value in just moving existing functionality into the kernel.  Simply  
let "dumb" file systems return ENOSYS and let glibc do it...  And any  
FS which can do it better can implement the function and then glibc  
should not go anywhere near it.


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Theodore Tso
On Sun, Mar 04, 2007 at 11:22:06PM +, Anton Altaparmakov wrote:
> And I specifically did NOT update the initialized size in the inode  
> thus it will remain at its old value thus all new allocated blocks  
> will be considered as present but not initialized thus a read will  
> always return zero whilst a write will do the right thing and pad  
> with zeroes as necessary (if the write is smaller than the block  
> size, etc).

Anton,

You're describing a method of doing in-advance preallocation
where the filesystem format explicitly has support for this kind of
feature in a way that doesn't require pre-zeroing the data blocks in
question.

The question which this subthread was concerned about was
whether the kernel should get involved in initializing datablocks in
the case where the filesystem format does not have this support, or
whether this functionality should continue to be done in userspace.
Given that glibc already has to support this for older kernels, I
would argue that there's no point putting in generic support for
filesystem that can't support a more advanced way of doing things.

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel
On Mon, 5 March 2007 00:32:14 +, Anton Altaparmakov wrote:
> 
> I don't know how your compression algorithm works [...]

LogFS is designed for flash media, so it does not have to worry much
about reducing disk seeks.  It is log-structured, which simplifies
compression further.

When writing a block, it basically compresses it and appends it to the
log.  Writes only have to be byte-aligned, so no space is lost for
padding.

The bad news for posix_fallocate() is that even if libc is smart enough
to write random data, mmap() can still cause problems.  If the VM
decides to write a given page twice, the second write compresses better
and the medium has filled up between the two writes, the users will have
fun.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Sat, Mar 03, 2007 at 11:45:32PM +0100, Arnd Bergmann wrote:
> > I'd be more happy to have the write out zeroes loop in glibc. ?And
> > glibc needs to have it anyway, for older kernels.
> 
> A generic_fallocate makes sense to me iff we can do it in the kernel
> more significantly more efficiently than in glibc, e.g. by using only
> a single page in page cache instead of one for each page to be preallocated.

We can't do that with the current page cache interfaces.  But what
might make sense is to have a block_dump_prealloc that takes a get_block
callback to do what you propose.  It still wouldn't be entirely generic,
but would allow block based filesystems to do a not entirely dumb
implementation.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel
On Mon, 5 March 2007 01:36:36 +0100, Arnd Bergmann wrote:
> 
> Using the current glibc implementation on a compressed file system ideally
> should be a very expensive no-op because you won't actually allocate much
> space for a file when writing zeroes to it. You also don't benefit of a
> contiguous allocation in logfs, since flash has uniform seek times over
> all the medium.
> 
> I'd suggest you implement posix_fallocate as an real nop and just return
> success without doing anything. You could also return ENOSPC in case
> the blocks requested by posix_fallocate don't fit on the medium without
> compression, but that is more or less just guesswork (like statfs is).

Quoting POSIX_FALLOCATE(3):
   The function posix_fallocate() ensures that disk space is allocated for
   the file referred to by the descriptor fd for the bytes  in  the range
   starting  at  offset  and continuing for len bytes.  After a successful
   call to posix_fallocate(), subsequent writes to bytes in the specified
   range are guaranteed not to fail because of lack of disk space.

   If  the  size  of  the  file  is less than offset+len, then the file is
   increased to this size; otherwise the file size is left unchanged.

Afaics, the (main) purpose of this function is not to decrease
fragmentation but to ensure mmap() won't cause any problems because the
medium fills up.  That problem exists for LogFS as well, once rw mmap()
is supported.

Simply returning success without doing anything would be a bug.  -ENOSPC
is a better choice, but still a lame implementation.  And falling back
on libc to write zeroes in a loop is an exercise in futility.

Does the allocation have to be persistent beyond lifetime of the file
descriptor?  It would be fairly simple to support the write guarantee
while the file is open (or rather the inode remains cached) and drop it
afterwards.

Jörn

-- 
"[One] doesn't need to know [...] how to cause a headache in order
to take an aspirin."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jan Kara
> >On Fri, 02 Mar 2007 09:40:54 +1100
> >Nathan Scott <[EMAIL PROTECTED]> wrote:
> >
> >
> >>On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> >>
> >>>On Fri, 2 Mar 2007 00:04:45 +0530
> >>>"Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> >>>
> >>>
> This is to give a heads up on few patches that we will be soon coming up
> with. These patches implement a new system call sys_fallocate() and a
> new inode operation "fallocate", for persistent preallocation. The new
> system call, as Andrew suggested, will look like:
> 
>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> >>>
> >>>...
> >>>
> >>>I'd agree with Eric on the "command" flag extension.
> >>
> >>Seems like a separate syscall would be better, "command" sounds
> >>a bit ioctl like, especially if that command is passed into the
> >>filesystems..
> >>
> >
> >
> >madvise, fadvise, lseek, etc seem to work OK.
> >
> >I get repeatedly traumatised by patch rejects whenever a new syscall gets
> >added, so I'm biased.
> >
> >The advantage of a command flag is that we can add new modes in the future
> >without causing lots of churn, waiting for arch maintainers to catch up,
> >potentially adding new compat code, etc.
> >
> >Rename it to "mode"? ;)
> >
> I am wondering if it is useful to add another mode to advise block 
> allocation policy? Something like indicating which physical block/block 
> group to allocate from (goal), and whether ask for strict contigous 
> blocks. This will help preallocation or reservation to choose the right 
> blocks for the file.
  Yes, I also think this would be useful so you can "guide"
preallocation for things like defragmentation (e.g. preallocate space
for the file being defragmented and move the file to it).

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jan Kara
 On Fri, 02 Mar 2007 09:40:54 +1100
 Nathan Scott [EMAIL PROTECTED] wrote:
 
 
 On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
 
 On Fri, 2 Mar 2007 00:04:45 +0530
 Amit K. Arora [EMAIL PROTECTED] wrote:
 
 
 This is to give a heads up on few patches that we will be soon coming up
 with. These patches implement a new system call sys_fallocate() and a
 new inode operation fallocate, for persistent preallocation. The new
 system call, as Andrew suggested, will look like:
 
  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
 
 ...
 
 I'd agree with Eric on the command flag extension.
 
 Seems like a separate syscall would be better, command sounds
 a bit ioctl like, especially if that command is passed into the
 filesystems..
 
 
 
 madvise, fadvise, lseek, etc seem to work OK.
 
 I get repeatedly traumatised by patch rejects whenever a new syscall gets
 added, so I'm biased.
 
 The advantage of a command flag is that we can add new modes in the future
 without causing lots of churn, waiting for arch maintainers to catch up,
 potentially adding new compat code, etc.
 
 Rename it to mode? ;)
 
 I am wondering if it is useful to add another mode to advise block 
 allocation policy? Something like indicating which physical block/block 
 group to allocate from (goal), and whether ask for strict contigous 
 blocks. This will help preallocation or reservation to choose the right 
 blocks for the file.
  Yes, I also think this would be useful so you can guide
preallocation for things like defragmentation (e.g. preallocate space
for the file being defragmented and move the file to it).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel
On Mon, 5 March 2007 01:36:36 +0100, Arnd Bergmann wrote:
 
 Using the current glibc implementation on a compressed file system ideally
 should be a very expensive no-op because you won't actually allocate much
 space for a file when writing zeroes to it. You also don't benefit of a
 contiguous allocation in logfs, since flash has uniform seek times over
 all the medium.
 
 I'd suggest you implement posix_fallocate as an real nop and just return
 success without doing anything. You could also return ENOSPC in case
 the blocks requested by posix_fallocate don't fit on the medium without
 compression, but that is more or less just guesswork (like statfs is).

Quoting POSIX_FALLOCATE(3):
   The function posix_fallocate() ensures that disk space is allocated for
   the file referred to by the descriptor fd for the bytes  in  the range
   starting  at  offset  and continuing for len bytes.  After a successful
   call to posix_fallocate(), subsequent writes to bytes in the specified
   range are guaranteed not to fail because of lack of disk space.

   If  the  size  of  the  file  is less than offset+len, then the file is
   increased to this size; otherwise the file size is left unchanged.

Afaics, the (main) purpose of this function is not to decrease
fragmentation but to ensure mmap() won't cause any problems because the
medium fills up.  That problem exists for LogFS as well, once rw mmap()
is supported.

Simply returning success without doing anything would be a bug.  -ENOSPC
is a better choice, but still a lame implementation.  And falling back
on libc to write zeroes in a loop is an exercise in futility.

Does the allocation have to be persistent beyond lifetime of the file
descriptor?  It would be fairly simple to support the write guarantee
while the file is open (or rather the inode remains cached) and drop it
afterwards.

Jörn

-- 
[One] doesn't need to know [...] how to cause a headache in order
to take an aspirin.
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Sat, Mar 03, 2007 at 11:45:32PM +0100, Arnd Bergmann wrote:
  I'd be more happy to have the write out zeroes loop in glibc. ?And
  glibc needs to have it anyway, for older kernels.
 
 A generic_fallocate makes sense to me iff we can do it in the kernel
 more significantly more efficiently than in glibc, e.g. by using only
 a single page in page cache instead of one for each page to be preallocated.

We can't do that with the current page cache interfaces.  But what
might make sense is to have a block_dump_prealloc that takes a get_block
callback to do what you propose.  It still wouldn't be entirely generic,
but would allow block based filesystems to do a not entirely dumb
implementation.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel
On Mon, 5 March 2007 00:32:14 +, Anton Altaparmakov wrote:
 
 I don't know how your compression algorithm works [...]

LogFS is designed for flash media, so it does not have to worry much
about reducing disk seeks.  It is log-structured, which simplifies
compression further.

When writing a block, it basically compresses it and appends it to the
log.  Writes only have to be byte-aligned, so no space is lost for
padding.

The bad news for posix_fallocate() is that even if libc is smart enough
to write random data, mmap() can still cause problems.  If the VM
decides to write a given page twice, the second write compresses better
and the medium has filled up between the two writes, the users will have
fun.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Theodore Tso
On Sun, Mar 04, 2007 at 11:22:06PM +, Anton Altaparmakov wrote:
 And I specifically did NOT update the initialized size in the inode  
 thus it will remain at its old value thus all new allocated blocks  
 will be considered as present but not initialized thus a read will  
 always return zero whilst a write will do the right thing and pad  
 with zeroes as necessary (if the write is smaller than the block  
 size, etc).

Anton,

You're describing a method of doing in-advance preallocation
where the filesystem format explicitly has support for this kind of
feature in a way that doesn't require pre-zeroing the data blocks in
question.

The question which this subthread was concerned about was
whether the kernel should get involved in initializing datablocks in
the case where the filesystem format does not have this support, or
whether this functionality should continue to be done in userspace.
Given that glibc already has to support this for older kernels, I
would argue that there's no point putting in generic support for
filesystem that can't support a more advanced way of doing things.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Anton Altaparmakov

On 5 Mar 2007, at 14:37, Theodore Tso wrote:

On Sun, Mar 04, 2007 at 11:22:06PM +, Anton Altaparmakov wrote:

And I specifically did NOT update the initialized size in the inode
thus it will remain at its old value thus all new allocated blocks
will be considered as present but not initialized thus a read will
always return zero whilst a write will do the right thing and pad
with zeroes as necessary (if the write is smaller than the block
size, etc).


You're describing a method of doing in-advance preallocation
where the filesystem format explicitly has support for this kind of
feature in a way that doesn't require pre-zeroing the data blocks in
question.


Indeed.


The question which this subthread was concerned about was
whether the kernel should get involved in initializing datablocks in
the case where the filesystem format does not have this support, or
whether this functionality should continue to be done in userspace.
Given that glibc already has to support this for older kernels, I
would argue that there's no point putting in generic support for
filesystem that can't support a more advanced way of doing things.


Yes, I understood that after I had sent my post...  And yes, I would  
agree.  If glibc already does this there does not appear to be any  
value in just moving existing functionality into the kernel.  Simply  
let dumb file systems return ENOSYS and let glibc do it...  And any  
FS which can do it better can implement the function and then glibc  
should not go anywhere near it.


Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Theodore Tso wrote:
 Given that glibc already has to support this for older kernels, I
 would argue that there's no point putting in generic support for
 filesystem that can't support a more advanced way of doing things.

Well, I'm sure the kernel can do better than the code we have in libc
now.  The kernel has access to the bitmasks which say which blocks have
already been allocated.  The libc code does not and we have to be very
simple-minded and simply touch every block.  And this means reading it
and then writing it back.  The kernel would know when the reading part
is not necessary.  Add to then the block granularity (we use f_bsize as
returned from fstatfs but that's not the best value in some cases) and
you have compelling data to have generic code in the kernel.  Then libc
implementation can then go away completely which is a good thing.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
 Does the allocation have to be persistent beyond lifetime of the file
 descriptor?

Of course.  You call posix_fallocate once for the lifetime of the file
when it is created to ensure that all future uses will work.

It seems your filesystem will not be able to support this unless
compression is turned off.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
 The bad news for posix_fallocate() is that even if libc is smart enough
 to write random data, mmap() can still cause problems.

This is not smart, quite to the contrary.  The standard guarantees that
all not-yet-written-to places in the file are zero.  And if a block has
already been written posix_fallocate cannot change it.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
 Theodore Tso wrote:
  Given that glibc already has to support this for older kernels, I
  would argue that there's no point putting in generic support for
  filesystem that can't support a more advanced way of doing things.
 
 Well, I'm sure the kernel can do better than the code we have in libc
 now.  The kernel has access to the bitmasks which say which blocks have
 already been allocated.

The layer of the kernel where a totally generic fallback would be
implemented does not have access to this information.  We could do
a mostly generic helper for block filesystems that allows to implement
fallocate this way without a lot of their own code.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel
On Mon, 5 March 2007 07:08:03 -0800, Ulrich Drepper wrote:
 Jörn Engel wrote:
  Does the allocation have to be persistent beyond lifetime of the file
  descriptor?
 
 Of course.  You call posix_fallocate once for the lifetime of the file
 when it is created to ensure that all future uses will work.

That part is not quite clear from the manpage but I trust most people
would assume the same.

 It seems your filesystem will not be able to support this unless
 compression is turned off.

Correct.  Compression needs to be turned off for a file, if
posix_fallocate(3) is to succeed.  What I could do is disable
compression (meaning that no data written in the future will be
compressed) and rewrite all blocks within the given range.

Still, it is quite obvious that noone designing this interface has lost
much thought to compressing filesystems.  Whatever I can come up with
will either be incompatible or some sort of hack.  :(

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
 Of course.  You call posix_fallocate once for the lifetime of the file
 when it is created to ensure that all future uses will work.
 
 That part is not quite clear from the manpage but I trust most people
 would assume the same.

Not only that, it is what this function is for.  In the POSIX committee
we've looked at the functions in detail before adding them, even if some
information is not in the man page but instead in the Rationale.


 Still, it is quite obvious that noone designing this interface has lost
 much thought to compressing filesystems.

You already have problems with supporting the functionality
posix_fallocate is supporting.  You cannot reliably support MAP_SHARED
files if all of a sudden the compression causes and expansion of a block
and that causes a ENOSPC error.  So, don't expect pity.  This is a
function in support of a real and reliable implementation of memory
mapped files.  You don't use MAP_SHARED on such filesystems, it'll eat
your kittens sooner or later anyway.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Theodore Tso
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
 Well, I'm sure the kernel can do better than the code we have in libc
 now.  The kernel has access to the bitmasks which say which blocks have
 already been allocated.  The libc code does not and we have to be very
 simple-minded and simply touch every block.  And this means reading it
 and then writing it back.  The kernel would know when the reading part
 is not necessary.  Add to then the block granularity (we use f_bsize as
 returned from fstatfs but that's not the best value in some cases) and
 you have compelling data to have generic code in the kernel.  Then libc
 implementation can then go away completely which is a good thing.

You have a very good point; indeed since we don't export an interface
which allows userspace to determine whether or not a block is in use,
that does mean a huge amount of churn in the page cache.  So maybe it
would be worth doing in the kernel as a result, although the libc
implementation still wouldn't be able to go away for long time due to
the need to be backwards compatible with older kernels that didn't
have this support.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Theodore Tso wrote:
 [...] although the libc
 implementation still wouldn't be able to go away for long time due to
 the need to be backwards compatible with older kernels that didn't
 have this support.

It's better than that.  If somebody compiles glibc to not run on older
kernels at all (tested at runtime) then the code is dropped.  E.g., the
current Fedora glibc does not support 2.6.8 or earlier.

So, don't let the compat code be a factor in the decision making.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Mingming Cao

Jan Kara wrote:

On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott [EMAIL PROTECTED] wrote:




On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:



On Fri, 2 Mar 2007 00:04:45 +0530
Amit K. Arora [EMAIL PROTECTED] wrote:




This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);


...

I'd agree with Eric on the command flag extension.


Seems like a separate syscall would be better, command sounds
a bit ioctl like, especially if that command is passed into the
filesystems..




madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to mode? ;)



I am wondering if it is useful to add another mode to advise block 
allocation policy? Something like indicating which physical block/block 
group to allocate from (goal), and whether ask for strict contigous 
blocks. This will help preallocation or reservation to choose the right 
blocks for the file.


  Yes, I also think this would be useful so you can guide
preallocation for things like defragmentation (e.g. preallocate space
for the file being defragmented and move the file to it).

Honza

Yep, I think it makes sense to use preallocation for defragmentation.
After all both preallocation and defragmentation shall call underlying 
filesystem multiple block allocator to try to allocate a chunk of 
contiguous blocks on disk. ext4 online defrag implementation by Takashi 
already support to choose a goal allocation block to guide the ext4 
block allocator to place the defraged file is a specific location.


Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
and/or whether the goal block is important over the size of prealloc 
extent), might make it more useful for the orginial goal (get contigous 
and guranteed blocks) and for defragmentation.


Regards,
Mingming
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Eric Sandeen
Jan Kara wrote:

 I am wondering if it is useful to add another mode to advise block 
 allocation policy? Something like indicating which physical block/block 
 group to allocate from (goal), and whether ask for strict contigous 
 blocks. This will help preallocation or reservation to choose the right 
 blocks for the file.
   Yes, I also think this would be useful so you can guide
 preallocation for things like defragmentation (e.g. preallocate space
 for the file being defragmented and move the file to it).

Hints  policies for allocation would certainly be useful, but I think
they belong outside this interface.  i.e. you could flag an inode for
whatever allocation you choose, and -then- call posix_fallocate so that
the allocator will take the hints you've given it.

See also this blurb from the posix_fallocate definition:

It is implementation-defined whether a previous posix_fadvise() call
influences allocation strategy.

FWIW I don't see a lot of point in asking for strict contiguous blocks
- the allocator will presumeably try to do this in any case, and I'm not
sure when you would want to fail if you get more than one extent...?

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Eric Sandeen
Jörn Engel wrote:
 Does the allocation have to be persistent beyond lifetime of the file
 descriptor?  It would be fairly simple to support the write guarantee
 while the file is open (or rather the inode remains cached) and drop it
 afterwards.

The posix_fallocate() function shall ensure that any required storage
for regular file data starting at offset and continuing for len bytes is
allocated on the file system storage media.

I interpret on the storage media to mean that it is persistent.

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 12:02:59PM -0800, Mingming Cao wrote:
 Yep, I think it makes sense to use preallocation for defragmentation.
 After all both preallocation and defragmentation shall call underlying 
 filesystem multiple block allocator to try to allocate a chunk of 
 contiguous blocks on disk. ext4 online defrag implementation by Takashi 
 already support to choose a goal allocation block to guide the ext4 
 block allocator to place the defraged file is a specific location.
 
 Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
 and/or whether the goal block is important over the size of prealloc 
 extent), might make it more useful for the orginial goal (get contigous 
 and guranteed blocks) and for defragmentation.

fallocate with the whence argument and flags is already quite complicated,
I'd rather have another call for placement decisions, that would
be called on an fd to do placement decissions for any further allocations
(prealloc, write, etc)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Christoph Hellwig
On Sun, Mar 04, 2007 at 08:11:17PM +, Anton Altaparmakov wrote:
> glibc cannot ever be smart enough because a file system driver will  
> always know better and be able to do things in a much more optimized  
> way.

Please read the thread again.  That is not what anyone proposed.
The issues we're discussing is whether fallback for a filesystem that
does not support preallocation natively should be done in kernelspace
or in userspace.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Jörn Engel
On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:
> 
> When you do it like this, who can the kernel/filesystem *guarantee* that
> when the data is written there actually is room on the harddrive?
> 
> What you described seems like using truncate/ftruncate to increase the
> file's size.  That is not at all what posix_fallocate is for.
> posix_fallocate must make sure that the requested blocks on the disk are
> reserved (allocated) for the file's use and that at no point in the
> future will, say, a msync() fail because a mmap(MAP_SHARED) page has
> been written to.

That actually causes an interesting problem for compressing filesystems.
The space consumed by blocks depends on their contents and how well it
compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling compression,
then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file is not
supported, so I can only return an error.  But which one?

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Arnd Bergmann
On Monday 05 March 2007, Anton Altaparmakov wrote:
> An alternative would be to allocate blocks and then when the data is  
> written perform the compression and free any blocks you do not need  
> any more because the data has shrunk sufficiently.  Depending on the  
> implementation details this could potentially create horrible  
> fragmentation as you would allocate a large consecutive region and  
> then go and drop random blocks from that region thus making the file  
> fragmented.

Unfortunately, this is not as easy on logfs, because there is no point
in allocating a block when there is no data to write into it. Fragmentation
on flash media is free, but you can never modify a block in place without
erasing it first. This means it will always be written to a new location
on the next write access.

One option that might work (similar to what you describe in your other mail)
is to have a per-inode count of reserved blocks, without allocating specific
blocks for them. The journal then needs to maintain the number of total
reserved blocks for all files and keep that in sync with blocks that were
reserved for specific inodes.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Arnd Bergmann
On Monday 05 March 2007, Jörn Engel wrote:
> That actually causes an interesting problem for compressing filesystems.
> The space consumed by blocks depends on their contents and how well it
> compresses.  At the moment, the only option I see to support
> posix_fallocate for LogFS is to set an inode flag disabling compression,
> then allocate the blocks.
> 
> But if the file already contains large amounts of compressed data, I
> have a problem.  Disabling compression for a range within a file is not
> supported, so I can only return an error.  But which one?

Using the current glibc implementation on a compressed file system ideally
should be a very expensive no-op because you won't actually allocate much
space for a file when writing zeroes to it. You also don't benefit of a
contiguous allocation in logfs, since flash has uniform seek times over
all the medium.

I'd suggest you implement posix_fallocate as an real nop and just return
success without doing anything. You could also return ENOSPC in case
the blocks requested by posix_fallocate don't fit on the medium without
compression, but that is more or less just guesswork (like statfs is).

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Anton Altaparmakov

On 5 Mar 2007, at 00:32, Anton Altaparmakov wrote:

On 5 Mar 2007, at 00:16, Jörn Engel wrote:

On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:


When you do it like this, who can the kernel/filesystem  
*guarantee* that

when the data is written there actually is room on the harddrive?

What you described seems like using truncate/ftruncate to  
increase the

file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the  
disk are

reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.


That actually causes an interesting problem for compressing  
filesystems.
The space consumed by blocks depends on their contents and how  
well it

compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling  
compression,

then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file  
is not

supported, so I can only return an error.  But which one?


I don't know how your compression algorithm works but at least on  
NTFS that bit is easy: you allocate the blocks and mark them as  
allocated then the compression engine will write non-compressed  
data to those blocks.  Basically it works like this "does  
compression block X have any sparse blocks?". If the answer is  
"yes" the block is treated as compressed data and if the answer is  
"no" the block is treated as uncompressed data.  This means that if  
the data cannot be compressed (and in some cases if the data  
compressed is bigger than the data uncompressed) the data is stored  
non-compressed.  That is the most space efficient method to do things.


An alternative would be to allocate blocks and then when the data  
is written perform the compression and free any blocks you do not  
need any more because the data has shrunk sufficiently.  Depending  
on the implementation details this could potentially create  
horrible fragmentation as you would allocate a large consecutive  
region and then go and drop random blocks from that region thus  
making the file fragmented.


And another thing you could do (best if you support journalling)  
would be to do the allocation and hang the details off the inode on a  
"preallocation list" of some kind and then as the data gets written  
use blocks from the preallocation list as you go along.  This would  
avoid the fragmentation issue for example.  You could then free the  
surplus blocks when the whole range of the file being covered by the  
preallocation list has been written to and/or when the file is closed  
for the last time (drop_inode/delete_inode).


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Anton Altaparmakov


On 5 Mar 2007, at 00:16, Jörn Engel wrote:


On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:


When you do it like this, who can the kernel/filesystem  
*guarantee* that

when the data is written there actually is room on the harddrive?

What you described seems like using truncate/ftruncate to increase  
the

file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the  
disk are

reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.


That actually causes an interesting problem for compressing  
filesystems.

The space consumed by blocks depends on their contents and how well it
compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling  
compression,

then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file is  
not

supported, so I can only return an error.  But which one?


I don't know how your compression algorithm works but at least on  
NTFS that bit is easy: you allocate the blocks and mark them as  
allocated then the compression engine will write non-compressed data  
to those blocks.  Basically it works like this "does compression  
block X have any sparse blocks?". If the answer is "yes" the block is  
treated as compressed data and if the answer is "no" the block is  
treated as uncompressed data.  This means that if the data cannot be  
compressed (and in some cases if the data compressed is bigger than  
the data uncompressed) the data is stored non-compressed.  That is  
the most space efficient method to do things.


An alternative would be to allocate blocks and then when the data is  
written perform the compression and free any blocks you do not need  
any more because the data has shrunk sufficiently.  Depending on the  
implementation details this could potentially create horrible  
fragmentation as you would allocate a large consecutive region and  
then go and drop random blocks from that region thus making the file  
fragmented.


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Anton Altaparmakov

Hi,

On 4 Mar 2007, at 22:38, Ulrich Drepper wrote:

Anton Altaparmakov wrote:

And that is it.  No zeroing needs to happen at all because we
have not updated the initialized size of the inode!


When you do it like this, who can the kernel/filesystem *guarantee*  
that

when the data is written there actually is room on the harddrive?


The blocks are allocated so of course it is guaranteed.  Subsequent  
writes to this file will not generate any allocations thus  
allocations cannot fail.  (-:



What you described seems like using truncate/ftruncate to increase the
file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the  
disk are

reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.


No that is different.  I described performing the allocations in the  
volume bitmap, i.e. for each allocated block the corresponding "in  
use" bit is set in the bitmap (NTFS uses a linear bitmap where byte  
0, bit 0 == physical block 0 of volume, byte 0, bit 1 == physical  
block 1 of volume, ... byte 1, bit 0 == block 8 of volume, ...).


Also I described updating the extent map of the inode such that it  
describes the physical blocks as belonging to the file, thus you  
would have "logical file block X corresponds to physical block Y on  
volume" entries entered into the extent map of the inode and they  
would describe the just allocated blocks.


Finally I described updating the allocated size in the inode which  
basically says "there are that many bytes worth of blocks allocated  
to this inode".


And optionally I described updating the data size in the inode which  
basically says "this file has size Z bytes".


And I specifically did NOT update the initialized size in the inode  
thus it will remain at its old value thus all new allocated blocks  
will be considered as present but not initialized thus a read will  
always return zero whilst a write will do the right thing and pad  
with zeroes as necessary (if the write is smaller than the block  
size, etc).


Note that you are right that this is like truncate in NTFS for non- 
sparse enabled inodes/volumes.


But for sparse ones, instead of doing any allocations in the bitmap  
and entering them in the extent map, you would simply add a single  
entry to the extent map that says "X blocks allocated starting at  
logical block Y corresponding to no physical blocks, i.e. they are  
sparse".  You would then also update the allocated size and data size  
as above and now you can even (but do not have to) update the  
initialized size to be equal to the data size as the file can be  
considered fully initialized because it is sparse.  As an  
implementation detail this truncate operation would not modify the  
compressed size of the inode (i.e. the really used on-disk space,  
i.e. what you get from running "du" as that does not change when you  
add sparse blocks) whilst the fallocate described above would update  
the compressed size (if the file is sparse or compressed - there is  
no compressed size in the inode if the inode is not sparse/ 
compressed) because the file now occupies more blocks on disk even if  
they are actually not initialized.


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Ulrich Drepper
Anton Altaparmakov wrote:
> And that is it.  No zeroing needs to happen at all because we
> have not updated the initialized size of the inode!

When you do it like this, who can the kernel/filesystem *guarantee* that
when the data is written there actually is room on the harddrive?

What you described seems like using truncate/ftruncate to increase the
file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the disk are
reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Arnd Bergmann
On Sunday 04 March 2007, Anton Altaparmakov wrote:
> > A generic_fallocate makes sense to me iff we can do it in the kernel
> > more significantly more efficiently than in glibc, e.g. by using only
> > a single page in page cache instead of one for each page to be  
> > preallocated.
> >
> > If  glibc is smart enough to do an optimal implementation, I fully  
> > agree
> > with you.
> 
> glibc cannot ever be smart enough because a file system driver will  
> always know better and be able to do things in a much more optimized  
> way.

Ok, that's not what I meant. It's obvious that the file system itself
can do better than both VFS and glibc. The question is whether VFS can
be better than glibc on file systems that don't offer their own
implementation of the fallocate operation.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Anton Altaparmakov

On 3 Mar 2007, at 22:45, Arnd Bergmann wrote:

On Friday 02 March 2007 00:38:19 Christoph Hellwig wrote:

Forgive me if I haven't put enough thought into it, but would it be
useful to create a generic_fallocate() that writes zeroed pages  
for any

non-existent pages in the range?  I don't know how glibc currently
implements posix_fallocate(), but maybe the kernel could do it more
efficiently, even in generic code.  Maybe we don't care, since  
the major

file systems can probably do something better in their own code.


I'd be more happy to have the write out zeroes loop in glibc.  And
glibc needs to have it anyway, for older kernels.


A generic_fallocate makes sense to me iff we can do it in the kernel
more significantly more efficiently than in glibc, e.g. by using only
a single page in page cache instead of one for each page to be  
preallocated.


If  glibc is smart enough to do an optimal implementation, I fully  
agree

with you.


glibc cannot ever be smart enough because a file system driver will  
always know better and be able to do things in a much more optimized  
way.


For example on NTFS fallocate() only needs to involve the setting of  
a few bits in the volume block allocation bitmap (one bit for each  
logical block being allocated) and update the extent map in the on- 
disk inode to reflect that those blocks are now allocated to the  
inode.  Then it just needs to update the allocated size and  
optionally the data size (if fallocate wants to increase the file  
size rather than just the allocated size).  And that is it.  No  
zeroing needs to happen at all because we have not updated the  
initialized size of the inode!


glibc can only dream of an implementation like this.  (-;

Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Anton Altaparmakov

On 3 Mar 2007, at 22:45, Arnd Bergmann wrote:

On Friday 02 March 2007 00:38:19 Christoph Hellwig wrote:

Forgive me if I haven't put enough thought into it, but would it be
useful to create a generic_fallocate() that writes zeroed pages  
for any

non-existent pages in the range?  I don't know how glibc currently
implements posix_fallocate(), but maybe the kernel could do it more
efficiently, even in generic code.  Maybe we don't care, since  
the major

file systems can probably do something better in their own code.


I'd be more happy to have the write out zeroes loop in glibc.  And
glibc needs to have it anyway, for older kernels.


A generic_fallocate makes sense to me iff we can do it in the kernel
more significantly more efficiently than in glibc, e.g. by using only
a single page in page cache instead of one for each page to be  
preallocated.


If  glibc is smart enough to do an optimal implementation, I fully  
agree

with you.


glibc cannot ever be smart enough because a file system driver will  
always know better and be able to do things in a much more optimized  
way.


For example on NTFS fallocate() only needs to involve the setting of  
a few bits in the volume block allocation bitmap (one bit for each  
logical block being allocated) and update the extent map in the on- 
disk inode to reflect that those blocks are now allocated to the  
inode.  Then it just needs to update the allocated size and  
optionally the data size (if fallocate wants to increase the file  
size rather than just the allocated size).  And that is it.  No  
zeroing needs to happen at all because we have not updated the  
initialized size of the inode!


glibc can only dream of an implementation like this.  (-;

Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Arnd Bergmann
On Sunday 04 March 2007, Anton Altaparmakov wrote:
  A generic_fallocate makes sense to me iff we can do it in the kernel
  more significantly more efficiently than in glibc, e.g. by using only
  a single page in page cache instead of one for each page to be  
  preallocated.
 
  If  glibc is smart enough to do an optimal implementation, I fully  
  agree
  with you.
 
 glibc cannot ever be smart enough because a file system driver will  
 always know better and be able to do things in a much more optimized  
 way.

Ok, that's not what I meant. It's obvious that the file system itself
can do better than both VFS and glibc. The question is whether VFS can
be better than glibc on file systems that don't offer their own
implementation of the fallocate operation.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Ulrich Drepper
Anton Altaparmakov wrote:
 And that is it.  No zeroing needs to happen at all because we
 have not updated the initialized size of the inode!

When you do it like this, who can the kernel/filesystem *guarantee* that
when the data is written there actually is room on the harddrive?

What you described seems like using truncate/ftruncate to increase the
file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the disk are
reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Anton Altaparmakov

Hi,

On 4 Mar 2007, at 22:38, Ulrich Drepper wrote:

Anton Altaparmakov wrote:

And that is it.  No zeroing needs to happen at all because we
have not updated the initialized size of the inode!


When you do it like this, who can the kernel/filesystem *guarantee*  
that

when the data is written there actually is room on the harddrive?


The blocks are allocated so of course it is guaranteed.  Subsequent  
writes to this file will not generate any allocations thus  
allocations cannot fail.  (-:



What you described seems like using truncate/ftruncate to increase the
file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the  
disk are

reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.


No that is different.  I described performing the allocations in the  
volume bitmap, i.e. for each allocated block the corresponding in  
use bit is set in the bitmap (NTFS uses a linear bitmap where byte  
0, bit 0 == physical block 0 of volume, byte 0, bit 1 == physical  
block 1 of volume, ... byte 1, bit 0 == block 8 of volume, ...).


Also I described updating the extent map of the inode such that it  
describes the physical blocks as belonging to the file, thus you  
would have logical file block X corresponds to physical block Y on  
volume entries entered into the extent map of the inode and they  
would describe the just allocated blocks.


Finally I described updating the allocated size in the inode which  
basically says there are that many bytes worth of blocks allocated  
to this inode.


And optionally I described updating the data size in the inode which  
basically says this file has size Z bytes.


And I specifically did NOT update the initialized size in the inode  
thus it will remain at its old value thus all new allocated blocks  
will be considered as present but not initialized thus a read will  
always return zero whilst a write will do the right thing and pad  
with zeroes as necessary (if the write is smaller than the block  
size, etc).


Note that you are right that this is like truncate in NTFS for non- 
sparse enabled inodes/volumes.


But for sparse ones, instead of doing any allocations in the bitmap  
and entering them in the extent map, you would simply add a single  
entry to the extent map that says X blocks allocated starting at  
logical block Y corresponding to no physical blocks, i.e. they are  
sparse.  You would then also update the allocated size and data size  
as above and now you can even (but do not have to) update the  
initialized size to be equal to the data size as the file can be  
considered fully initialized because it is sparse.  As an  
implementation detail this truncate operation would not modify the  
compressed size of the inode (i.e. the really used on-disk space,  
i.e. what you get from running du as that does not change when you  
add sparse blocks) whilst the fallocate described above would update  
the compressed size (if the file is sparse or compressed - there is  
no compressed size in the inode if the inode is not sparse/ 
compressed) because the file now occupies more blocks on disk even if  
they are actually not initialized.


Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Anton Altaparmakov


On 5 Mar 2007, at 00:16, Jörn Engel wrote:


On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:


When you do it like this, who can the kernel/filesystem  
*guarantee* that

when the data is written there actually is room on the harddrive?

What you described seems like using truncate/ftruncate to increase  
the

file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the  
disk are

reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.


That actually causes an interesting problem for compressing  
filesystems.

The space consumed by blocks depends on their contents and how well it
compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling  
compression,

then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file is  
not

supported, so I can only return an error.  But which one?


I don't know how your compression algorithm works but at least on  
NTFS that bit is easy: you allocate the blocks and mark them as  
allocated then the compression engine will write non-compressed data  
to those blocks.  Basically it works like this does compression  
block X have any sparse blocks?. If the answer is yes the block is  
treated as compressed data and if the answer is no the block is  
treated as uncompressed data.  This means that if the data cannot be  
compressed (and in some cases if the data compressed is bigger than  
the data uncompressed) the data is stored non-compressed.  That is  
the most space efficient method to do things.


An alternative would be to allocate blocks and then when the data is  
written perform the compression and free any blocks you do not need  
any more because the data has shrunk sufficiently.  Depending on the  
implementation details this could potentially create horrible  
fragmentation as you would allocate a large consecutive region and  
then go and drop random blocks from that region thus making the file  
fragmented.


Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Anton Altaparmakov

On 5 Mar 2007, at 00:32, Anton Altaparmakov wrote:

On 5 Mar 2007, at 00:16, Jörn Engel wrote:

On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:


When you do it like this, who can the kernel/filesystem  
*guarantee* that

when the data is written there actually is room on the harddrive?

What you described seems like using truncate/ftruncate to  
increase the

file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the  
disk are

reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.


That actually causes an interesting problem for compressing  
filesystems.
The space consumed by blocks depends on their contents and how  
well it

compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling  
compression,

then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file  
is not

supported, so I can only return an error.  But which one?


I don't know how your compression algorithm works but at least on  
NTFS that bit is easy: you allocate the blocks and mark them as  
allocated then the compression engine will write non-compressed  
data to those blocks.  Basically it works like this does  
compression block X have any sparse blocks?. If the answer is  
yes the block is treated as compressed data and if the answer is  
no the block is treated as uncompressed data.  This means that if  
the data cannot be compressed (and in some cases if the data  
compressed is bigger than the data uncompressed) the data is stored  
non-compressed.  That is the most space efficient method to do things.


An alternative would be to allocate blocks and then when the data  
is written perform the compression and free any blocks you do not  
need any more because the data has shrunk sufficiently.  Depending  
on the implementation details this could potentially create  
horrible fragmentation as you would allocate a large consecutive  
region and then go and drop random blocks from that region thus  
making the file fragmented.


And another thing you could do (best if you support journalling)  
would be to do the allocation and hang the details off the inode on a  
preallocation list of some kind and then as the data gets written  
use blocks from the preallocation list as you go along.  This would  
avoid the fragmentation issue for example.  You could then free the  
surplus blocks when the whole range of the file being covered by the  
preallocation list has been written to and/or when the file is closed  
for the last time (drop_inode/delete_inode).


Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Arnd Bergmann
On Monday 05 March 2007, Jörn Engel wrote:
 That actually causes an interesting problem for compressing filesystems.
 The space consumed by blocks depends on their contents and how well it
 compresses.  At the moment, the only option I see to support
 posix_fallocate for LogFS is to set an inode flag disabling compression,
 then allocate the blocks.
 
 But if the file already contains large amounts of compressed data, I
 have a problem.  Disabling compression for a range within a file is not
 supported, so I can only return an error.  But which one?

Using the current glibc implementation on a compressed file system ideally
should be a very expensive no-op because you won't actually allocate much
space for a file when writing zeroes to it. You also don't benefit of a
contiguous allocation in logfs, since flash has uniform seek times over
all the medium.

I'd suggest you implement posix_fallocate as an real nop and just return
success without doing anything. You could also return ENOSPC in case
the blocks requested by posix_fallocate don't fit on the medium without
compression, but that is more or less just guesswork (like statfs is).

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Arnd Bergmann
On Monday 05 March 2007, Anton Altaparmakov wrote:
 An alternative would be to allocate blocks and then when the data is  
 written perform the compression and free any blocks you do not need  
 any more because the data has shrunk sufficiently.  Depending on the  
 implementation details this could potentially create horrible  
 fragmentation as you would allocate a large consecutive region and  
 then go and drop random blocks from that region thus making the file  
 fragmented.

Unfortunately, this is not as easy on logfs, because there is no point
in allocating a block when there is no data to write into it. Fragmentation
on flash media is free, but you can never modify a block in place without
erasing it first. This means it will always be written to a new location
on the next write access.

One option that might work (similar to what you describe in your other mail)
is to have a per-inode count of reserved blocks, without allocating specific
blocks for them. The journal then needs to maintain the number of total
reserved blocks for all files and keep that in sync with blocks that were
reserved for specific inodes.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Jörn Engel
On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:
 
 When you do it like this, who can the kernel/filesystem *guarantee* that
 when the data is written there actually is room on the harddrive?
 
 What you described seems like using truncate/ftruncate to increase the
 file's size.  That is not at all what posix_fallocate is for.
 posix_fallocate must make sure that the requested blocks on the disk are
 reserved (allocated) for the file's use and that at no point in the
 future will, say, a msync() fail because a mmap(MAP_SHARED) page has
 been written to.

That actually causes an interesting problem for compressing filesystems.
The space consumed by blocks depends on their contents and how well it
compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling compression,
then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file is not
supported, so I can only return an error.  But which one?

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Christoph Hellwig
On Sun, Mar 04, 2007 at 08:11:17PM +, Anton Altaparmakov wrote:
 glibc cannot ever be smart enough because a file system driver will  
 always know better and be able to do things in a much more optimized  
 way.

Please read the thread again.  That is not what anyone proposed.
The issues we're discussing is whether fallback for a filesystem that
does not support preallocation natively should be done in kernelspace
or in userspace.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-03 Thread Arnd Bergmann
On Friday 02 March 2007 00:38:19 Christoph Hellwig wrote:
> > Forgive me if I haven't put enough thought into it, but would it be
> > useful to create a generic_fallocate() that writes zeroed pages for any
> > non-existent pages in the range?  I don't know how glibc currently
> > implements posix_fallocate(), but maybe the kernel could do it more
> > efficiently, even in generic code.  Maybe we don't care, since the major
> > file systems can probably do something better in their own code.
>
> I'd be more happy to have the write out zeroes loop in glibc.  And
> glibc needs to have it anyway, for older kernels.

A generic_fallocate makes sense to me iff we can do it in the kernel
more significantly more efficiently than in glibc, e.g. by using only
a single page in page cache instead of one for each page to be preallocated.

If  glibc is smart enough to do an optimal implementation, I fully agree
with you.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-03 Thread Arnd Bergmann
On Friday 02 March 2007 00:38:19 Christoph Hellwig wrote:
  Forgive me if I haven't put enough thought into it, but would it be
  useful to create a generic_fallocate() that writes zeroed pages for any
  non-existent pages in the range?  I don't know how glibc currently
  implements posix_fallocate(), but maybe the kernel could do it more
  efficiently, even in generic code.  Maybe we don't care, since the major
  file systems can probably do something better in their own code.

 I'd be more happy to have the write out zeroes loop in glibc.  And
 glibc needs to have it anyway, for older kernels.

A generic_fallocate makes sense to me iff we can do it in the kernel
more significantly more efficiently than in glibc, e.g. by using only
a single page in page cache instead of one for each page to be preallocated.

If  glibc is smart enough to do an optimal implementation, I fully agree
with you.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Mingming Cao

Andrew Morton wrote:


On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott <[EMAIL PROTECTED]> wrote:



On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:


On Fri, 2 Mar 2007 00:04:45 +0530
"Amit K. Arora" <[EMAIL PROTECTED]> wrote:



This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);


...

I'd agree with Eric on the "command" flag extension.


Seems like a separate syscall would be better, "command" sounds
a bit ioctl like, especially if that command is passed into the
filesystems..




madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to "mode"? ;)

I am wondering if it is useful to add another mode to advise block 
allocation policy? Something like indicating which physical block/block 
group to allocate from (goal), and whether ask for strict contigous 
blocks. This will help preallocation or reservation to choose the right 
blocks for the file.


Right now neither ext4 preallocation implementation or reservation are 
guranteed to allocate/reserve contigugous extents. If the application 
told it so, it could do more searching to satisfy the requirement.


Or fadvise is the right interface?

Mingming

I'm inclined to merge this patch nice and early, so the syscall number is
stabilised.  Otherwise the people who are working on out-of-tree code (ie:
ext4) will have to keep playing catchup.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Mingming Cao

Dave Kleikamp wrote:

On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote:


On Thu, 01 Mar 2007 22:44:16 +
Dave Kleikamp <[EMAIL PROTECTED]> wrote:



On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:


On Fri, 2 Mar 2007 00:04:45 +0530
"Amit K. Arora" <[EMAIL PROTECTED]> wrote:



+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   inode = file->f_path.dentry->d_inode;
+   if (inode->i_op && inode->i_op->fallocate)
+   ret = inode->i_op->fallocate(inode, offset, len);
+   else
+   ret = -ENOTTY;
+   fput(file);
+out:
+return ret;
+}


ENOTTY is a bit unconventional - we often use EINVAL for this sort of
thing.  But EINVAL has other meanings for posix_fallocate() and isn't
really appropriate here anyway.  So I'm not sure what would be better...


Would EINVAL (or whatever) make it back to the caller of
posix_fallocate(), or would glibc fall back to its current
implementation?

Forgive me if I haven't put enough thought into it, but would it be
useful to create a generic_fallocate() that writes zeroed pages for any
non-existent pages in the range?  I don't know how glibc currently
implements posix_fallocate(), but maybe the kernel could do it more
efficiently, even in generic code.  Maybe we don't care, since the major
file systems can probably do something better in their own code.


Given that glibc already implements fallocate for all filesystems, it will
need to continue to do so for filesystems which don't implement this
syscall - otherwise applications would start breaking.



I didn't make it clear, but my point was to call generic_fallocate if
the file system did not define i_op->allocate().

if (inode->i_op && inode->i_op->fallocate)
ret = inode->i_op->fallocate(inode, offset, len);
else
ret = generic_fallocate(inode, offset, len);

I'm not sure it's worth the effort, but I thought I'd throw the idea out
there.


I think this is useful.

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Eric Sandeen

Badari Pulavarty wrote:


BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 


With XFS at least, "du," "stat," etc tell you a little:

[EMAIL PROTECTED] test]# touch resvsp
[EMAIL PROTECTED] test]# xfs_io resvsp
xfs_io> resvsp 0 10g

The file is 0 length, but is using 10g of blocks:
(with posix_fallocate this would move the size out to 10g as well)

[EMAIL PROTECTED] test]# ls -lh resvsp
-rw-r--r--  1 root root 0 Nov 28 14:11 resvsp
[EMAIL PROTECTED] test]# du -hc resvsp
10G resvsp
10G total
[EMAIL PROTECTED] test]# stat resvsp
  File: `resvsp'
  Size: 0   Blocks: 20971520   IO Block: 4096   regular 
empty file

Device: 81eh/2078d  Inode: 186 Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)

xfs also has an interface to find out what allocations are where:

if you reserve some ranges not starting at 0...

[EMAIL PROTECTED] test]# xfs_io resvsp
xfs_io> resvsp 1g 1g
xfs_io> resvsp 3g 1g
xfs_io> resvsp 5g 1g
xfs_io> quit

[EMAIL PROTECTED] test]# xfs_bmap -v resvsp
resvsp:
 EXT: FILE-OFFSET   BLOCK-RANGE   AG AG-OFFSET 
TOTAL FLAGS
   0: [0..2097151]: hole 
2097152
   1: [2097152..4194303]:   42392..2139543 0 (42392..2139543) 
2097152 1
   2: [4194304..6291455]:   hole 
2097152
   3: [6291456..8388607]:   4236696..6333847   0 (4236696..6333847) 
2097152 1
   4: [8388608..10485759]:  hole 
2097152
   5: [10485760..12582911]: 8431000..10528151  0 (8431000..10528151) 
2097152 1


The flags of 1 mean that these extents is preallocated/unwritten.

I suppose outside of XFS, FIBMAP is your best bet, but that won't tell 
you what is preallocated vs. allocated/written


-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Andrew Morton
On Fri, 02 Mar 2007 08:13:00 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> > 
> > > What about 
> > > if the
> > > blocks already exists ? What would be return values in those cases ?
> > 
> > 0 on success, other normal errors oetherwise..
> > 
> > If asked for a range that includes already-allocated blocks, you just 
> > allocate any non-allocated blocks in the range, I think.
> 
> Yes. What I was trying to figure out is, if there is a requirement that
> interface need to return exact number of bytes it *really* allocated
> (like write() or read()). I can't think of any, but just wanted to
> through it out..

Hopefully not, because posix didn't anticipate that.

We could of course return a positive number on success, but it'd get
tricky on 32-bit machines.

> BTW, what is the interface for finding out what is the size of the
> pre-allocated file ? 

stat.st_blocks?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Badari Pulavarty
On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote:
> Badari Pulavarty wrote:
> > 
> > Amit K. Arora wrote:
> > 
> >> This is to give a heads up on few patches that we will be soon coming up
> >> with. These patches implement a new system call sys_fallocate() and a
> >> new inode operation "fallocate", for persistent preallocation. The new
> >> system call, as Andrew suggested, will look like:
> >>
> >>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> >>
> > I am wondering about return values from this syscall ? Is it supposed to 
> > return the
> > number of bytes allocated ? What about partial allocations ? 
> 
> If you don't have enough blocks to cover the request, you should 
> probably just return -ENOSPC, not a partial allocation.

That could be challenging, when multiple writers are working in
parallel. You may not be able to return -ENOSPC, till you fail the
allocation (for filesystems which alllocates a block at a time).

> 
> > What about 
> > if the
> > blocks already exists ? What would be return values in those cases ?
> 
> 0 on success, other normal errors oetherwise..
> 
> If asked for a range that includes already-allocated blocks, you just 
> allocate any non-allocated blocks in the range, I think.

Yes. What I was trying to figure out is, if there is a requirement that
interface need to return exact number of bytes it *really* allocated
(like write() or read()). I can't think of any, but just wanted to
through it out..

BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Eric Sandeen

Badari Pulavarty wrote:


Amit K. Arora wrote:


This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

I am wondering about return values from this syscall ? Is it supposed to 
return the
number of bytes allocated ? What about partial allocations ? 


If you don't have enough blocks to cover the request, you should 
probably just return -ENOSPC, not a partial allocation.


What about 
if the

blocks already exists ? What would be return values in those cases ?


0 on success, other normal errors oetherwise..

If asked for a range that includes already-allocated blocks, you just 
allocate any non-allocated blocks in the range, I think.


-Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Ulrich Drepper

On 3/2/07, Dave Kleikamp <[EMAIL PROTECTED]> wrote:

Then there's no need for sys_allocate to return a long.


Every syscall must return a long.  Otherwise you can have problems on
64-bit archs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Jan Engelhardt

On Mar 1 2007 23:09, Dave Kleikamp wrote:
>> 
>> Given that glibc already implements fallocate for all filesystems, it will
>> need to continue to do so for filesystems which don't implement this
>> syscall - otherwise applications would start breaking.
>
>I didn't make it clear, but my point was to call generic_fallocate if
>the file system did not define i_op->allocate().
>
>if (inode->i_op && inode->i_op->fallocate)
>   ret = inode->i_op->fallocate(inode, offset, len);
>else
>   ret = generic_fallocate(inode, offset, len);
>
>I'm not sure it's worth the effort, but I thought I'd throw the idea out
>there.

Writing zeroes using glibc emu most likely means write() --
so generic_fallocate should be preferable (think splice).
Or does glibc use mmap() and it's all different?


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Dave Kleikamp
Amit wrote:

>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

On Thu, 2007-03-01 at 22:16 -0800, Andrew Morton wrote:
> On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > Just curious .. What does posix_fallocate() return ?
> 
> bookmark this:
> 
> http://www.opengroup.org/onlinepubs/009695399/nfindex.html
> 
> Upon successful completion, posix_fallocate() shall return zero;
> otherwise, an error number shall be returned to indicate the error.

Then there's no need for sys_allocate to return a long.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Dave Kleikamp
On Fri, 2007-03-02 at 18:45 +0800, Andreas Dilger wrote:
> On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
> > One thing I'd like to see is a cmd argument as well, to allow for 
> > example allocation vs. reservation (i.e. allocating blocks vs. simply 
> > reserving a number), as well as the inverse of those functions 
> > (un-reservation, de-allocation)?
> > 
> > If the allocation interface allows allocation/reservation within 
> > arbitrary ranges, if the only way to un-allocate is via a truncate, 
> > that's pretty asymmetric.
> 
> I'd rather we just get the oft-discussed punch() syscall instead.
> This is really what "unallocate" would do for persistent allocations
> and it would be useful for files that were not preallocated.

I can see a difference though.  punch() would throw away written data as
well as pre-allocated-but-never-written-to data.  I can see where a user
might preallocate a large file and do a lot of random writes.  At some
point, he decides the file isn't going to grow much more, so let's free
up the remaining pre-allocated blocks.  This makes even more sense with
reservation.

The alternative would be to have punch() take a flag to specify if only
preallocated or reserved blocks should be freed.

> 
> For filesystems that don't implement punch glibc() would do zero-filling
> of the punched area I guess (to make it equivalent to reading from a
> hole in the file).

Or it could just fail.  Writing zeroes may be really slow and not give
the caller any benefit.  (The intention was to free blocks back to the
file system.)

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Andreas Dilger
On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
> One thing I'd like to see is a cmd argument as well, to allow for 
> example allocation vs. reservation (i.e. allocating blocks vs. simply 
> reserving a number), as well as the inverse of those functions 
> (un-reservation, de-allocation)?
> 
> If the allocation interface allows allocation/reservation within 
> arbitrary ranges, if the only way to un-allocate is via a truncate, 
> that's pretty asymmetric.

I'd rather we just get the oft-discussed punch() syscall instead.
This is really what "unallocate" would do for persistent allocations
and it would be useful for files that were not preallocated.

For filesystems that don't implement punch glibc() would do zero-filling
of the punched area I guess (to make it equivalent to reading from a
hole in the file).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Andreas Dilger
On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
 One thing I'd like to see is a cmd argument as well, to allow for 
 example allocation vs. reservation (i.e. allocating blocks vs. simply 
 reserving a number), as well as the inverse of those functions 
 (un-reservation, de-allocation)?
 
 If the allocation interface allows allocation/reservation within 
 arbitrary ranges, if the only way to un-allocate is via a truncate, 
 that's pretty asymmetric.

I'd rather we just get the oft-discussed punch() syscall instead.
This is really what unallocate would do for persistent allocations
and it would be useful for files that were not preallocated.

For filesystems that don't implement punch glibc() would do zero-filling
of the punched area I guess (to make it equivalent to reading from a
hole in the file).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Dave Kleikamp
On Fri, 2007-03-02 at 18:45 +0800, Andreas Dilger wrote:
 On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
  One thing I'd like to see is a cmd argument as well, to allow for 
  example allocation vs. reservation (i.e. allocating blocks vs. simply 
  reserving a number), as well as the inverse of those functions 
  (un-reservation, de-allocation)?
  
  If the allocation interface allows allocation/reservation within 
  arbitrary ranges, if the only way to un-allocate is via a truncate, 
  that's pretty asymmetric.
 
 I'd rather we just get the oft-discussed punch() syscall instead.
 This is really what unallocate would do for persistent allocations
 and it would be useful for files that were not preallocated.

I can see a difference though.  punch() would throw away written data as
well as pre-allocated-but-never-written-to data.  I can see where a user
might preallocate a large file and do a lot of random writes.  At some
point, he decides the file isn't going to grow much more, so let's free
up the remaining pre-allocated blocks.  This makes even more sense with
reservation.

The alternative would be to have punch() take a flag to specify if only
preallocated or reserved blocks should be freed.

 
 For filesystems that don't implement punch glibc() would do zero-filling
 of the punched area I guess (to make it equivalent to reading from a
 hole in the file).

Or it could just fail.  Writing zeroes may be really slow and not give
the caller any benefit.  (The intention was to free blocks back to the
file system.)

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Dave Kleikamp
Amit wrote:

  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

On Thu, 2007-03-01 at 22:16 -0800, Andrew Morton wrote:
 On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  Just curious .. What does posix_fallocate() return ?
 
 bookmark this:
 
 http://www.opengroup.org/onlinepubs/009695399/nfindex.html
 
 Upon successful completion, posix_fallocate() shall return zero;
 otherwise, an error number shall be returned to indicate the error.

Then there's no need for sys_allocate to return a long.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Jan Engelhardt

On Mar 1 2007 23:09, Dave Kleikamp wrote:
 
 Given that glibc already implements fallocate for all filesystems, it will
 need to continue to do so for filesystems which don't implement this
 syscall - otherwise applications would start breaking.

I didn't make it clear, but my point was to call generic_fallocate if
the file system did not define i_op-allocate().

if (inode-i_op  inode-i_op-fallocate)
   ret = inode-i_op-fallocate(inode, offset, len);
else
   ret = generic_fallocate(inode, offset, len);

I'm not sure it's worth the effort, but I thought I'd throw the idea out
there.

Writing zeroes using glibc emu most likely means write() --
so generic_fallocate should be preferable (think splice).
Or does glibc use mmap() and it's all different?


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Ulrich Drepper

On 3/2/07, Dave Kleikamp [EMAIL PROTECTED] wrote:

Then there's no need for sys_allocate to return a long.


Every syscall must return a long.  Otherwise you can have problems on
64-bit archs.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Eric Sandeen

Badari Pulavarty wrote:


Amit K. Arora wrote:


This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

I am wondering about return values from this syscall ? Is it supposed to 
return the
number of bytes allocated ? What about partial allocations ? 


If you don't have enough blocks to cover the request, you should 
probably just return -ENOSPC, not a partial allocation.


What about 
if the

blocks already exists ? What would be return values in those cases ?


0 on success, other normal errors oetherwise..

If asked for a range that includes already-allocated blocks, you just 
allocate any non-allocated blocks in the range, I think.


-Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Badari Pulavarty
On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote:
 Badari Pulavarty wrote:
  
  Amit K. Arora wrote:
  
  This is to give a heads up on few patches that we will be soon coming up
  with. These patches implement a new system call sys_fallocate() and a
  new inode operation fallocate, for persistent preallocation. The new
  system call, as Andrew suggested, will look like:
 
   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
 
  I am wondering about return values from this syscall ? Is it supposed to 
  return the
  number of bytes allocated ? What about partial allocations ? 
 
 If you don't have enough blocks to cover the request, you should 
 probably just return -ENOSPC, not a partial allocation.

That could be challenging, when multiple writers are working in
parallel. You may not be able to return -ENOSPC, till you fail the
allocation (for filesystems which alllocates a block at a time).

 
  What about 
  if the
  blocks already exists ? What would be return values in those cases ?
 
 0 on success, other normal errors oetherwise..
 
 If asked for a range that includes already-allocated blocks, you just 
 allocate any non-allocated blocks in the range, I think.

Yes. What I was trying to figure out is, if there is a requirement that
interface need to return exact number of bytes it *really* allocated
(like write() or read()). I can't think of any, but just wanted to
through it out..

BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Andrew Morton
On Fri, 02 Mar 2007 08:13:00 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote:

  
   What about 
   if the
   blocks already exists ? What would be return values in those cases ?
  
  0 on success, other normal errors oetherwise..
  
  If asked for a range that includes already-allocated blocks, you just 
  allocate any non-allocated blocks in the range, I think.
 
 Yes. What I was trying to figure out is, if there is a requirement that
 interface need to return exact number of bytes it *really* allocated
 (like write() or read()). I can't think of any, but just wanted to
 through it out..

Hopefully not, because posix didn't anticipate that.

We could of course return a positive number on success, but it'd get
tricky on 32-bit machines.

 BTW, what is the interface for finding out what is the size of the
 pre-allocated file ? 

stat.st_blocks?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Eric Sandeen

Badari Pulavarty wrote:


BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 


With XFS at least, du, stat, etc tell you a little:

[EMAIL PROTECTED] test]# touch resvsp
[EMAIL PROTECTED] test]# xfs_io resvsp
xfs_io resvsp 0 10g

The file is 0 length, but is using 10g of blocks:
(with posix_fallocate this would move the size out to 10g as well)

[EMAIL PROTECTED] test]# ls -lh resvsp
-rw-r--r--  1 root root 0 Nov 28 14:11 resvsp
[EMAIL PROTECTED] test]# du -hc resvsp
10G resvsp
10G total
[EMAIL PROTECTED] test]# stat resvsp
  File: `resvsp'
  Size: 0   Blocks: 20971520   IO Block: 4096   regular 
empty file

Device: 81eh/2078d  Inode: 186 Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)

xfs also has an interface to find out what allocations are where:

if you reserve some ranges not starting at 0...

[EMAIL PROTECTED] test]# xfs_io resvsp
xfs_io resvsp 1g 1g
xfs_io resvsp 3g 1g
xfs_io resvsp 5g 1g
xfs_io quit

[EMAIL PROTECTED] test]# xfs_bmap -v resvsp
resvsp:
 EXT: FILE-OFFSET   BLOCK-RANGE   AG AG-OFFSET 
TOTAL FLAGS
   0: [0..2097151]: hole 
2097152
   1: [2097152..4194303]:   42392..2139543 0 (42392..2139543) 
2097152 1
   2: [4194304..6291455]:   hole 
2097152
   3: [6291456..8388607]:   4236696..6333847   0 (4236696..6333847) 
2097152 1
   4: [8388608..10485759]:  hole 
2097152
   5: [10485760..12582911]: 8431000..10528151  0 (8431000..10528151) 
2097152 1


The flags of 1 mean that these extents is preallocated/unwritten.

I suppose outside of XFS, FIBMAP is your best bet, but that won't tell 
you what is preallocated vs. allocated/written


-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Mingming Cao

Dave Kleikamp wrote:

On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote:


On Thu, 01 Mar 2007 22:44:16 +
Dave Kleikamp [EMAIL PROTECTED] wrote:



On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:


On Fri, 2 Mar 2007 00:04:45 +0530
Amit K. Arora [EMAIL PROTECTED] wrote:



+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   inode = file-f_path.dentry-d_inode;
+   if (inode-i_op  inode-i_op-fallocate)
+   ret = inode-i_op-fallocate(inode, offset, len);
+   else
+   ret = -ENOTTY;
+   fput(file);
+out:
+return ret;
+}


ENOTTY is a bit unconventional - we often use EINVAL for this sort of
thing.  But EINVAL has other meanings for posix_fallocate() and isn't
really appropriate here anyway.  So I'm not sure what would be better...


Would EINVAL (or whatever) make it back to the caller of
posix_fallocate(), or would glibc fall back to its current
implementation?

Forgive me if I haven't put enough thought into it, but would it be
useful to create a generic_fallocate() that writes zeroed pages for any
non-existent pages in the range?  I don't know how glibc currently
implements posix_fallocate(), but maybe the kernel could do it more
efficiently, even in generic code.  Maybe we don't care, since the major
file systems can probably do something better in their own code.


Given that glibc already implements fallocate for all filesystems, it will
need to continue to do so for filesystems which don't implement this
syscall - otherwise applications would start breaking.



I didn't make it clear, but my point was to call generic_fallocate if
the file system did not define i_op-allocate().

if (inode-i_op  inode-i_op-fallocate)
ret = inode-i_op-fallocate(inode, offset, len);
else
ret = generic_fallocate(inode, offset, len);

I'm not sure it's worth the effort, but I thought I'd throw the idea out
there.


I think this is useful.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Mingming Cao

Andrew Morton wrote:


On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott [EMAIL PROTECTED] wrote:



On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:


On Fri, 2 Mar 2007 00:04:45 +0530
Amit K. Arora [EMAIL PROTECTED] wrote:



This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);


...

I'd agree with Eric on the command flag extension.


Seems like a separate syscall would be better, command sounds
a bit ioctl like, especially if that command is passed into the
filesystems..




madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to mode? ;)

I am wondering if it is useful to add another mode to advise block 
allocation policy? Something like indicating which physical block/block 
group to allocate from (goal), and whether ask for strict contigous 
blocks. This will help preallocation or reservation to choose the right 
blocks for the file.


Right now neither ext4 preallocation implementation or reservation are 
guranteed to allocate/reserve contigugous extents. If the application 
told it so, it could do more searching to satisfy the requirement.


Or fadvise is the right interface?

Mingming

I'm inclined to merge this patch nice and early, so the syscall number is
stabilised.  Otherwise the people who are working on out-of-tree code (ie:
ext4) will have to keep playing catchup.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Ulrich Drepper
Andrew Morton wrote:
> Perhaps Ulrich can comment.

I was out of town, hence the delay.

I think that if there is no support for the syscall the correct answer
is to return ENOSYS.  In this case the current userlevel code would be
used and ENOSYS is also used to trigger the use of the compat code in
glibc in case the syscall does not exist at all.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> Just curious .. What does posix_fallocate() return ?

bookmark this:

http://www.opengroup.org/onlinepubs/009695399/nfindex.html

Upon successful completion, posix_fallocate() shall return zero;
otherwise, an error number shall be returned to indicate the error.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Badari Pulavarty


Amit K. Arora wrote:


This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

I am wondering about return values from this syscall ? Is it supposed to 
return the
number of bytes allocated ? What about partial allocations ? What about 
if the

blocks already exists ? What would be return values in those cases ?

Just curious .. What does posix_fallocate() return ?

Thanks,
Badari







-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Christoph Hellwig
On Thu, Mar 01, 2007 at 05:29:15PM -0600, Eric Sandeen wrote:
> Amit K. Arora wrote:
> 
> Might want more error checking in there, something like (rough cut)...
> (or is some of this glibc's job?)

Yeah, we need to have this checks.  We can't rely on userspace not
passing arguments that might corrupt your filesystem or let you
escalate privilegues.

> which would keep things in line with posix_fallocate's specified errors, 
> too?

Yes, very good idea.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >