Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-26 Thread Mark Lord

Benjamin Herrenschmidt wrote:

On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote:

I suppose so. I don't remember all of the details, but iirc, it has to
do with crossing 64K boundaries. Some controllers can't handle it.

It's not only the _size_ of the segments, it's their alignment.

The iommu will not keep alignement beyond the page size (and even
then... on powerpc with a 64k base page size, you may still end up with
a 4k aligned result, but let's not go there now).

..

That's just not possible, unless the IOMMU *splits* segments.
And the IOMMU experts here say that it never does that.


It is totally possible, and I know as wrote part of the powerpc iommu
code :-)

The iommu code makes no guarantee vs. preserving the alignment of a
segment, at least not below PAGE_SIZE.

Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well
get back a 4K aligned 64K segment.

Enforcing natural alignment in the iommu code only happens for
dma_alloc_coherent (it uses order-N allocations anyway), it doesn't
happen for map_sg. If we were to do that, we would make it very likely
for iommu allocations to fail on machine with small DMA windows.

Ben.

..

That's interesting.  Can you point us to the exact file::lines where
this happens?  It would be good to ensure that this gets fixed.

I'm copying Fujita Tomonori on this thread now -- he's the dude who's
trying to sort out the IOMMU mess.

Thanks
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-26 Thread Mark Lord

Mark Lord wrote:

Benjamin Herrenschmidt wrote:

On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote:

I suppose so. I don't remember all of the details, but iirc, it has to
do with crossing 64K boundaries. Some controllers can't handle it.

It's not only the _size_ of the segments, it's their alignment.

The iommu will not keep alignement beyond the page size (and even
then... on powerpc with a 64k base page size, you may still end up with
a 4k aligned result, but let's not go there now).

..

That's just not possible, unless the IOMMU *splits* segments.
And the IOMMU experts here say that it never does that.


It is totally possible, and I know as wrote part of the powerpc iommu
code :-)

The iommu code makes no guarantee vs. preserving the alignment of a
segment, at least not below PAGE_SIZE.

Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well
get back a 4K aligned 64K segment.

Enforcing natural alignment in the iommu code only happens for
dma_alloc_coherent (it uses order-N allocations anyway), it doesn't
happen for map_sg. If we were to do that, we would make it very likely
for iommu allocations to fail on machine with small DMA windows.

Ben.

..

That's interesting.  Can you point us to the exact file::lines where
this happens?  It would be good to ensure that this gets fixed.

I'm copying Fujita Tomonori on this thread now -- he's the dude who's
trying to sort out the IOMMU mess.

..

Mmm.. looks like ppc is already fixed in mainline,
commit 740c3ce66700640a6e6136ff679b067e92125794

Ben?
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Jeff Garzik
As an aside, ISTR tomo-san was working on eliminating the need for the 
/2 by tackling the details on the IOMMU side...


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Mark Lord

Jeff Garzik wrote:

Mark Lord wrote:

Jeff,

We had a discussion here today about IOMMUs,
and they *never* split sg list entries -- they only ever *merge*.

And this happens only after the block layer has
already done merging while respecting q-seg_boundary_mask.

So worst case, the IOMMU may merge everything, and then in
libata we unmerge them again.  But the end result can never
exceed the max_sg_entries limit enforced by the block layer.


shrug  Early experience said otherwise.  The split in foo_fill_sg() 
and resulting sg_tablesize reduction were both needed to successfully 
transfer data, when Ben H originally did the work.


If Ben H and everyone on the arch side agrees with the above analysis, I 
would be quite happy to remove all those / 2.




This can cost a lot of memory, as using NCQ effectively multiplies
everything by 32..


I recommend dialing down the hyperbole a bit :)

a lot in this case is...  maybe another page or two per table, if 
that.  Compared with everything else in the system going on, with 
16-byte S/G entries, S/G table size is really the least of our worries.

..

Well, today each sg table is about a page in size,
and sata_mv has 32 of them per port.
So cutting them in half would save 16 pages per port,
or 64 pages per host controller.

That's a lot for a small system, but maybe not for my 4GB test boxes.

If you were truly concerned about memory usage in sata_mv, a more 
effective route is simply reducing MV_MAX_SG_CT to a number closer to 
the average s/g table size -- which is far, far lower than 256 
(currently MV_MAX_SG_CT), or even 128 (MV_MAX_SG_CT/2).


Or moving to a scheme where you allocate (for example) S/G tables with 
32 entries... then allocate on the fly for the rare case where the S/G 
table must be larger...

..

Oh, absolutely.. that's on my clean up list once the rest of
the driver becomes stable and mostly done.  But for now, safety and correctness is 
far more paramount in sata_mv.  :)

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Mark Lord

Jeff Garzik wrote:
As an aside, ISTR tomo-san was working on eliminating the need for the 
/2 by tackling the details on the IOMMU side...

..

Yes, tomo-san just led a nice detailed discussion of it here at LSF'08,
and he agrees that even today it shouldn't affect us that way.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Benjamin Herrenschmidt

On Mon, 2008-02-25 at 19:15 -0500, Jeff Garzik wrote:
 Mark Lord wrote:
  Jeff,
  
  We had a discussion here today about IOMMUs,
  and they *never* split sg list entries -- they only ever *merge*.
  
  And this happens only after the block layer has
  already done merging while respecting q-seg_boundary_mask.
  
  So worst case, the IOMMU may merge everything, and then in
  libata we unmerge them again.  But the end result can never
  exceed the max_sg_entries limit enforced by the block layer.
 
 shrug  Early experience said otherwise.  The split in foo_fill_sg() 
 and resulting sg_tablesize reduction were both needed to successfully 
 transfer data, when Ben H originally did the work.
 
 If Ben H and everyone on the arch side agrees with the above analysis, I 
 would be quite happy to remove all those / 2.

The split wasn't done by the iommu. The split was done by the IDE code
itself to handle the stupid 64k crossing thingy. If it's done
differently now, it might be possible to remove it, I haven't looked.

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Mark Lord

Benjamin Herrenschmidt wrote:

On Mon, 2008-02-25 at 19:15 -0500, Jeff Garzik wrote:

Mark Lord wrote:

Jeff,

We had a discussion here today about IOMMUs,
and they *never* split sg list entries -- they only ever *merge*.

And this happens only after the block layer has
already done merging while respecting q-seg_boundary_mask.

So worst case, the IOMMU may merge everything, and then in
libata we unmerge them again.  But the end result can never
exceed the max_sg_entries limit enforced by the block layer.
shrug  Early experience said otherwise.  The split in foo_fill_sg() 
and resulting sg_tablesize reduction were both needed to successfully 
transfer data, when Ben H originally did the work.


If Ben H and everyone on the arch side agrees with the above analysis, I 
would be quite happy to remove all those / 2.


The split wasn't done by the iommu. The split was done by the IDE code
itself to handle the stupid 64k crossing thingy. If it's done
differently now, it might be possible to remove it, I haven't looked.

..

The block layer uses seg_boundary_mask to ensure that we never have
to split them again in the LLD.

A very long time ago, when I wrote the IDE DMA code, this was not the case.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Mark Lord

Mark Lord wrote:

Benjamin Herrenschmidt wrote:

..

The split wasn't done by the iommu. The split was done by the IDE code
itself to handle the stupid 64k crossing thingy. If it's done
differently now, it might be possible to remove it, I haven't looked.

..

The block layer uses seg_boundary_mask to ensure that we never have
to split them again in the LLD.

..

James B.  suggests that we stick a WARN_ON() into libata to let us
know if that precondition is violated.  Sounds like an easy thing to do
for a couple of -rc cycles someday.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Benjamin Herrenschmidt

 The block layer uses seg_boundary_mask to ensure that we never have
 to split them again in the LLD.
 
 A very long time ago, when I wrote the IDE DMA code, this was not the case.

Not good enough, still, because the boundaries can change due to the
iommu merging, thus the split must be re-done.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Benjamin Herrenschmidt

 James B.  suggests that we stick a WARN_ON() into libata to let us
 know if that precondition is violated.  Sounds like an easy thing to do
 for a couple of -rc cycles someday.

If the block layer gives us a 32k block aligned on a 32k boundary
(aligned), we have no guarantee that the iommu will not turn that into
something unaligned crossing a 32k (and thus possibly a 64k) boundary.

On powerpc, the iommu operates on 4k pages and only provides that level
of alignment to dma_map_sg() (dma_alloc_coherent are naturally aligned,
but not map_sg, that would put way too much pressure on the allocator on
machines that have pinhole-sized iommu space).

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Mark Lord

Benjamin Herrenschmidt wrote:

James B.  suggests that we stick a WARN_ON() into libata to let us
know if that precondition is violated.  Sounds like an easy thing to do
for a couple of -rc cycles someday.


If the block layer gives us a 32k block aligned on a 32k boundary
(aligned), we have no guarantee that the iommu will not turn that into
something unaligned crossing a 32k (and thus possibly a 64k) boundary.

..

Certainly, but never any worse than what the block layer gave originally.

The important note being:  IOMMU only ever *merges*, it never *splits*.

Which means that, by the time we split up any mis-merges again for 64K 
crossings,
we can never have more SG segments than what the block layer originally
fed to the IOMMU stuff.

Or so the IOMMU and SCSI experts here at LSF'08 have assured me,
even after my own skeptical questioning.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Benjamin Herrenschmidt

On Mon, 2008-02-25 at 23:38 -0500, Mark Lord wrote:
 Benjamin Herrenschmidt wrote:
  James B.  suggests that we stick a WARN_ON() into libata to let us
  know if that precondition is violated.  Sounds like an easy thing to do
  for a couple of -rc cycles someday.
  
  If the block layer gives us a 32k block aligned on a 32k boundary
  (aligned), we have no guarantee that the iommu will not turn that into
  something unaligned crossing a 32k (and thus possibly a 64k) boundary.
 ..
 
 Certainly, but never any worse than what the block layer gave originally.
 
 The important note being:  IOMMU only ever *merges*, it never *splits*.

Yes, but it will also change the address and doesn't guarantee the
alignment.

 Which means that, by the time we split up any mis-merges again for 64K 
 crossings,
 we can never have more SG segments than what the block layer originally
 fed to the IOMMU stuff.

 Or so the IOMMU and SCSI experts here at LSF'08 have assured me,
 even after my own skeptical questioning.

I suppose so. I don't remember all of the details, but iirc, it has to
do with crossing 64K boundaries. Some controllers can't handle it.

It's not only the _size_ of the segments, it's their alignment.

The iommu will not keep alignement beyond the page size (and even
then... on powerpc with a 64k base page size, you may still end up with
a 4k aligned result, but let's not go there now).

So that means that even if your block layer gives you nice aligned less
than 64k segments that don't cross 64k boundaries, and even if your
iommu isn't doing any merging at all, it may still give you back things
that do not respect that 64k alignment boundary, might cross them, and
thus might need to be split.

Now, it would make sense (if we don't have it already) to have a flag
provided by the host controller that tells us whether it suffers from
that limitation, and if not, we get avoid the whole thing.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Mark Lord

Benjamin Herrenschmidt wrote:

On Mon, 2008-02-25 at 23:38 -0500, Mark Lord wrote:

Benjamin Herrenschmidt wrote:

James B.  suggests that we stick a WARN_ON() into libata to let us
know if that precondition is violated.  Sounds like an easy thing to do
for a couple of -rc cycles someday.

If the block layer gives us a 32k block aligned on a 32k boundary
(aligned), we have no guarantee that the iommu will not turn that into
something unaligned crossing a 32k (and thus possibly a 64k) boundary.

..

Certainly, but never any worse than what the block layer gave originally.

The important note being:  IOMMU only ever *merges*, it never *splits*.


Yes, but it will also change the address and doesn't guarantee the
alignment.


Which means that, by the time we split up any mis-merges again for 64K 
crossings,
we can never have more SG segments than what the block layer originally
fed to the IOMMU stuff.

Or so the IOMMU and SCSI experts here at LSF'08 have assured me,
even after my own skeptical questioning.


I suppose so. I don't remember all of the details, but iirc, it has to
do with crossing 64K boundaries. Some controllers can't handle it.

It's not only the _size_ of the segments, it's their alignment.

The iommu will not keep alignement beyond the page size (and even
then... on powerpc with a 64k base page size, you may still end up with
a 4k aligned result, but let's not go there now).

..

That's just not possible, unless the IOMMU *splits* segments.
And the IOMMU experts here say that it never does that.

-ml
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata .sg_tablesize: why always dividing by 2 ?

2008-02-25 Thread Benjamin Herrenschmidt

On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote:
  I suppose so. I don't remember all of the details, but iirc, it has to
  do with crossing 64K boundaries. Some controllers can't handle it.
  
  It's not only the _size_ of the segments, it's their alignment.
  
  The iommu will not keep alignement beyond the page size (and even
  then... on powerpc with a 64k base page size, you may still end up with
  a 4k aligned result, but let's not go there now).
 ..
 
 That's just not possible, unless the IOMMU *splits* segments.
 And the IOMMU experts here say that it never does that.

It is totally possible, and I know as wrote part of the powerpc iommu
code :-)

The iommu code makes no guarantee vs. preserving the alignment of a
segment, at least not below PAGE_SIZE.

Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well
get back a 4K aligned 64K segment.

Enforcing natural alignment in the iommu code only happens for
dma_alloc_coherent (it uses order-N allocations anyway), it doesn't
happen for map_sg. If we were to do that, we would make it very likely
for iommu allocations to fail on machine with small DMA windows.

Ben.
 


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html