On 09/03/2014 02:13 PM, Dave Chinner wrote:
<>
> 
> When direct IO fails ext4 falls back to buffered IO, right? And
> dax_do_io() can return partial writes, yes?
> 

There is no buffered writes with DAX. .I.E buffered writes are always
direct as well. (No page cache)

> So that means if you get, say, ENOSPC part way through a DAX write,
> ext4 can start dirtying the page cache from
> __generic_file_write_iter() because the DAX write didn't wholly
> complete? And say this ENOSPC races with space being freed from
> another inode, then the buffered write will succeed and we'll end up
> with coherency issues, right?
> 
> This is not an idle question - XFS if firing asserts all over the
> place when doing ENOSPC testing because DAX is returning partial
> writes and the XFS direct IO code is expecting them to either wholly
> complete or wholly fail. I can make the DAX variant do allow partial
> writes, but I'm not going to add a useless fallback to buffered IO
> for XFS when the (fully featured) direct allocation fails.
> 

Right, no fall back. Because a fallback is just a retry, because in any
way DAX assumes there is never a page_cache_page for a written data

> Indeed, I note that in the dax_fault code, any page found in the
> page cache is explicitly removed and released, and the direct mapped
> block replaces that page in the vma. IOWs, this code expects pages
> to be clean as we're only supposed to have regions covered by holes
> using cached pages (dax_load_hole()). 

Exactly, page_cache_page are only/always "regions covered by holes"

Once there is a real block allocated for an offset it will be directly
mapped to the vm without a page_cache_page.

> So if we've done a buffered
> write, we're going to toss out dirty pages the moment there is a
> page fault on the range and map the unmodified backing store in
> instead.
> 

No! There is never "buffered write" with DAX. That is: there is never
a page_cache_page that holds data which will belong to the storage
later. DAX means zero-page-cache

> That just seems wrong. Maybe I've forgotten something, but this
> looks like a wart that we don't need and shouldn't bake into this
> interface as both ext4 and XFS can allocate into holes and extend
> files from from the direct IO interfaces. Of course, correct me if
> I'm wrong about ext4 capabilities...
> 

Yes you have misread the patchset, all writes are always done directly
to bdev->direct_access(..) memory *never* via a copy to page_cache.

Currently The only existence of radix-tree pages is for ZERO pages that
cover holes, which get thrown out as clean or COWed on mkwrite

BTW Matthew: It took me a while to figure out the VFS/VMA api but
I managed to map a single ZERO page to all holes and COW them to
real blocks on mkwrite. It needed a combination of flags but the
main trick is that at mkwrite I do:

        /* our zero page doesn't really hold the correct offset to the file in
         * page->index so vmf->pgoff is incorrect, lets fix that */
        vmf->pgoff = vma->vm_pgoff + (((unsigned long)vmf->virtual_address -
                        vma->vm_start) >> PAGE_SHIFT);
        /* call fault handler to get a real page for writing */
        ret = _xip_file_fault(vma, vmf);
        /* invalidate all other mappings to that location */
        unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, PAGE_SIZE, 1);

        /* mkwrite must lock the original page and return VM_FAULT_LOCKED */
        if (ret == VM_FAULT_NOPAGE) {
                lock_page(m1fs_zero_page);
                ret = VM_FAULT_LOCKED;
        }
        return ret;

At _xip_file_fault() also called from .fault I do in the case of a hole:
        if (!(vmf->flags & FAULT_FLAG_WRITE)) {
                ...
                block = _find_data_block(inode, vmf->pgoff);
                if (!block) {
                        vmf->page = g_zero_page;
                        err = vm_insert_page(vma,
                                        (unsigned long)vmf->virtual_address,
                                        vmf->page);
                        goto after_insert;
                }
        } else {

Above g_zero_page is my own global zero page, PAGE_ZERO will not work.
_find_data_block() is like your get_buffer but only for the read case,
the write case uses a different _get_block_create().

Please tell me if it is interesting for you? I can try to patch your DAX
patchset to do the same. This can always be done later as an optimization.

> Cheers,
> Dave.
> 

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to