Re: TTM merging?

2008-05-14 Thread Jerome Glisse
On Tue, 13 May 2008 21:35:16 +0100 (IST)
Dave Airlie [EMAIL PROTECTED] wrote:

 1) I feel there hasn't been enough open driver coverage to prove it. So 
 far we have done an Intel IGD, we have a lot of code that isn't required 
 for these devices, so the question of how much code exists purely to 
 support poulsbo closed source userspace there is and why we need to live 
 with it. Both radeon and nouveau developers have expressed frustration 
 about the fencing internals being really hard to work with which doesn't 
 bode well for maintainability in the future.

Well my ttm experiment bring me up to EXA with radeon, i also done several
small 3d test to see how i want to send command. So from my experiments here
are the things that are becoming painfull for me.

On some radeon hw (most of newer card with big amount of ram) you can't
map vram beyond aperture, well you can be you need to reprogram card
aperture and it's not somethings you want to do. TTM assumption is that
memory access are done through map of the buffer and so in this situation
this become cumberstone. We already discussed this and the idea was to
split vram but i don't like this solution. So in the end i am more and
more convinced that we should avoid object mapping in vma of client i see
2 advantages to this : no tlb flush on vma, no hard to solve page maping
aliasing.

On fence side i hoped that i could have reasonable code using IRQ working
reliably but after discussion with AMD what i was doing was obviously not
recommanded and prone to hard GPU lockup which is no go for me. The last
solution i have in mind about synchronization ie knowing when gpu is done
with a buffer could not use IRQ at least not on all hw i am interesed in
(r3xx/r4xx). Of course i don't want to busy wait for knowing when GPU is
done. Also fence code put too much assumption on what we should provide,
while fencing might prove usefull, i think it can be more well served by
driver specific ioctl than by a common infrastructure where hw obviously
doesn't fit well in the scheme due to their differences.

And like Stephane, i think virtual memory from GPU stuff can't be used
at its best in this scheme.

That said, i share also some concern on GEM like the high memory page but
i think this one is workable with help of kernel people. For vram the
solution discussed so far and which i like is to have driver choose
based on client request on which object to put their and to see vram as
a cache. So we will have all object backed by a ram copy (which can be
swapped) then it's all a matter on syncing vram copy  ram copy when
necessary. Domain  pread/pwrite access let you easily do this sync
only on the necessary area. Also for suspend becomes easier just sync
object where write domain is GPU. So all in all i agree that GEM might
ask each driver to redo some stuff but i think a large set of helper
function can leverage this, but more importantly i see this as freedom
for each driver and the only way to cope with hw differences.

Cheers,
Jerome Glisse [EMAIL PROTECTED]

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Thomas Hellström
Jerome Glisse wrote:
 On Tue, 13 May 2008 21:35:16 +0100 (IST)
 Dave Airlie [EMAIL PROTECTED] wrote:

   
 1) I feel there hasn't been enough open driver coverage to prove it. So 
 far we have done an Intel IGD, we have a lot of code that isn't required 
 for these devices, so the question of how much code exists purely to 
 support poulsbo closed source userspace there is and why we need to live 
 with it. Both radeon and nouveau developers have expressed frustration 
 about the fencing internals being really hard to work with which doesn't 
 bode well for maintainability in the future.
 

 Well my ttm experiment bring me up to EXA with radeon, i also done several
 small 3d test to see how i want to send command. So from my experiments here
 are the things that are becoming painfull for me.

 On some radeon hw (most of newer card with big amount of ram) you can't
 map vram beyond aperture, well you can be you need to reprogram card
 aperture and it's not somethings you want to do. TTM assumption is that
 memory access are done through map of the buffer and so in this situation
 this become cumberstone. We already discussed this and the idea was to
 split vram but i don't like this solution. So in the end i am more and
 more convinced that we should avoid object mapping in vma of client i see
 2 advantages to this : no tlb flush on vma, no hard to solve page maping
 aliasing.

 On fence side i hoped that i could have reasonable code using IRQ working
 reliably but after discussion with AMD what i was doing was obviously not
 recommanded and prone to hard GPU lockup which is no go for me. The last
 solution i have in mind about synchronization ie knowing when gpu is done
 with a buffer could not use IRQ at least not on all hw i am interesed in
 (r3xx/r4xx). Of course i don't want to busy wait for knowing when GPU is
 done. Also fence code put too much assumption on what we should provide,
 while fencing might prove usefull, i think it can be more well served by
 driver specific ioctl than by a common infrastructure where hw obviously
 doesn't fit well in the scheme due to their differences.

 And like Stephane, i think virtual memory from GPU stuff can't be used
 at its best in this scheme.

 That said, i share also some concern on GEM like the high memory page but
 i think this one is workable with help of kernel people. For vram the
 solution discussed so far and which i like is to have driver choose
 based on client request on which object to put their and to see vram as
 a cache. So we will have all object backed by a ram copy (which can be
 swapped) then it's all a matter on syncing vram copy  ram copy when
 necessary. Domain  pread/pwrite access let you easily do this sync
 only on the necessary area. Also for suspend becomes easier just sync
 object where write domain is GPU. So all in all i agree that GEM might
 ask each driver to redo some stuff but i think a large set of helper
 function can leverage this, but more importantly i see this as freedom
 for each driver and the only way to cope with hw differences.

 Cheers,
 Jerome Glisse [EMAIL PROTECTED]
   
Jerome, Dave, Keith

It's hard to argue against people trying things out and finding it's not 
really what they want, so I'm not going to do that.

The biggest argument (apart from the fencing) seems to be that people 
thinks TTM stops them from doing what they want with the hardware, 
although it seems like the Nouveau needs and Intel UMA needs are quite 
opposite. In an open-source community where people work on things 
because they want to, not being able to do what you want to is a bad thing,

OTOH a stall and disagreement about what's the best thing  to  use is 
even worse.  It confuses the users and it's particularly bad for  people 
trying to write drivers on a commercial basis.

I've looked through KeithPs mail to look for a way to use GEM for future 
development. Since many things will be device-dependent I think it's 
possible for us to work around some issues I see,  but  a couple of big 
things remain.

1) The inability to map device memory. The design arguments and proposed 
solution for VRAM are not really valid. Think of this, probably not too 
uncommon, scenario of a single pixel fallback composite to a scanout 
buffer in vram. Or a texture or video frame upload:

A) Page in all GEM pages, because they've been paged out.
B) Copy the complete scanout buffer to GEM because it's dirty. Untile.
C) Write the pixel.
D) Copy the complete buffer back while tiling.

2) Reserving pages when allocating VRAM buffers is also a very bad 
solution particularly on systems with a lot of VRAM and little system 
RAM. (Multiple card machines?). GEM basically needs to reserve 
swap-space when buffers are created, and put a limit on the pinned 
physical pages.  We basically should not be able to fail memory 
allocation during execbuf, because we cannot recover from that.

Other things like GFP_HIGHUSER etc are probably fixable if there is a 
will 

Re: TTM merging?

2008-05-14 Thread Keith Whitwell

 I do worry that TTM is not Linux enough, it seems you have decided that we 
 can never do in-kernel allocations at any useable speed and punted the 
 work into userspace, which makes life easier for Gallium as its more like 
 what Windows does, but I'm not sure this is a good solution for Linux.
 

I have no idea where this set of ideas come from, and it's a little disturbing 
to me. 

On a couple of levels, it's clearly bogus.  

Firstly, TTM and its libdrm interfaces predate gallium by years.

Secondly, the windows work we've done with gallium to date has been on XP and 
_entirely_ in kernel space, so the whole issue of user/kernel allocation 
strategies never came up.

Thirdly, Gallium's backend interfaces are all about abstracting away from the 
OS, so that drivers can be picked up and dumped down in multiple places.  It's 
ludicrous to suggest that the act of abstracting away from TTM has in itself 
skewed TTM -- the point is that the driver has been made independent of TTM.  
The point of Gallium is that it should work on top of *anything* -- if we had 
had to skew TTM in some way to achieve that, then we would have already failed 
right at the starting point...

Lastly, and most importantly, I believe that using TTM kernel allocations to 
back a user space sub-allocator *is the right strategy*.  

This has nothing to do with Gallium.  No matter how fast you make a kernel 
allocator (and I applaud efforts to make it fast), it is always going to be 
quicker to do allocations locally.  This is the reason we have malloc() and not 
just mmap() or brk/sbrk.  

Also, sub-allocation doesn't imply massive preallocation.  That bug is well 
fixed by Thomas' user-space slab allocator code.

Keith


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Ben Skeggs
   1) I feel there hasn't been enough open driver coverage to prove it. So 
   far
   we have done an Intel IGD, we have a lot of code that isn't required for
   these devices, so the question of how much code exists purely to support
   poulsbo closed source userspace there is and why we need to live with it.
   Both radeon and nouveau developers have expressed frustration about the
   fencing internals being really hard to work with which doesn't bode well 
   for
   maintainability in the future.
 
  OK. So basically what I'm asking is that when we have full-feathered open
  source drivers available that
  utilize TTM, either as part of DRM core, or, if needed, as part of
  driver-specific code, do you see anything
  else that prevents that from being pushed? That would be very valuable to 
  know
  for anyone starting porting work. ?
 
 I was hoping that by now, one of the radeon or nouveau drivers would have 
 adopted TTM, or at least demoed something working using it, this hasn't 
 happened which worries me, perhaps glisse or darktama could fill in on 
 what limited them from doing it. The fencing internals are very very scary 
 and seem to be a major stumbling block.
The fencing internals do seem overly complicated indeed, but that's
something that I'm personally OK with taking the time to figure out how
to get right.  Is there any good documentation around that describes it
in detail?

I actually started working on nouveau/ttm again a month or so back, with
the intention of actually having the work land this time.  Overall, I
don't have much problem with TTM and would be willing to work with it.
Supporting G8x/G9x chips was the reason the work's stalled again, I
wasn't sure at the time what requirements we'd have from a memory
manager.

The issue on G8x is that the 3D engine will refuse to render to linear
surfaces, and in order to setup tiling we need to make use of a
channel's page tables.  The driver doesn't get any control when VRAM is
allocated so that it can setup the page tables appropriately etc.  I
just had a thought that the driver-specific validation ioctl could
probably handle that at the last minute, so perhaps that's also not an
issue.  I'll look more into G8x/ttm after I finish my current G8x work.

Another minor issue (probably doesn't effect merging?): Nouveau makes
extensive use fence classes, we assign 1 fence class to each GPU channel
(read: context + command submission mechanism).  We have 128 of these on
G80 cards, the current _DRM_FENCE_CLASSES is 8 which is insufficient
even for NV1x hardware.

So overall, I'm basically fine with TTM now that I've actually made a
proper attempt at using it..  GEM does seem interesting, I'll also
follow its development while I continue with other non-mm G80 work.

Cheers,
Ben.
 
 I do worry that TTM is not Linux enough, it seems you have decided that we 
 can never do in-kernel allocations at any useable speed and punted the 
 work into userspace, which makes life easier for Gallium as its more like 
 what Windows does, but I'm not sure this is a good solution for Linux.
 
 The real question is whether TTM suits the driver writers for use in Linux 
 desktop and embedded environments, and I think so far I'm not seeing 
 enough positive feedback from the desktop side.
 
 Also wrt the i915 driver it has too many experiments in it, the i915 users 
 need to group together and remove the codepaths that make no sense and 
 come up with a ssuitable userspace driver for it, remove all unused 
 fencing mechanisms etc..
 
 Dave.
 
   
  /Thomas
  
  
  
  
  
  
 
 -
 This SF.net email is sponsored by: Microsoft 
 Defy all challenges. Microsoft(R) Visual Studio 2008. 
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
 --
 ___
 Dri-devel mailing list
 Dri-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dri-devel


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Jerome Glisse
On Wed, 14 May 2008 12:09:06 +0200
Thomas Hellström [EMAIL PROTECTED] wrote:

 Jerome Glisse wrote:
 Jerome, Dave, Keith
 
 
 1) The inability to map device memory. The design arguments and proposed 
 solution for VRAM are not really valid. Think of this, probably not too 
 uncommon, scenario of a single pixel fallback composite to a scanout 
 buffer in vram. Or a texture or video frame upload:
 
 A) Page in all GEM pages, because they've been paged out.
 B) Copy the complete scanout buffer to GEM because it's dirty. Untile.
 C) Write the pixel.
 D) Copy the complete buffer back while tiling.

With pwrite/pread you give offset and size of things you are interested in.
So for single pixel case it will pread a page and pwrite it once fallback
finished. I totaly agree that dowloading whole object on fallback is to be
avoided. But as long as we don't have a fallback which draws the whole
screen then we are fine, and as anyway such fallback will be desastrous
wether we map vram or not lead me to discard this drawback and just
accept pain for such fallback.

Also i am confident that we can find a more clever way in such case.
Like doing the whole rendering in ram and updating the final result
so assuming that the up to date copy is in ram and that vram might
be out of sync.
 
 2) Reserving pages when allocating VRAM buffers is also a very bad 
 solution particularly on systems with a lot of VRAM and little system 
 RAM. (Multiple card machines?). GEM basically needs to reserve 
 swap-space when buffers are created, and put a limit on the pinned 
 physical pages.  We basically should not be able to fail memory 
 allocation during execbuf, because we cannot recover from that.

Well this solve the suspend problem we were discussing at xds ie what
to do on buffer. If we know that we have room to put buffer then we
don't to worry about which buffer we are ready to loose. Given that
opengl don't give any clue on that this sounds like a good approach.

For embedded device where every piece of ram still matter i guess
you also have to deal with suspend case so you have a way to either
save vram content or to preserve it. I don't see any problem with
gem to cop with this case too.

 Other things like GFP_HIGHUSER etc are probably fixable if there is a 
 will to do it.
 
 So if GEM is the future, these shortcomings must IMHO be addressed. In 
 particular GEM should not stop people from mapping device memory 
 directly. Particularly not in the view of the arguments against TTM 
 previously outlined.

As i said i have come to the opinion that not mapping vram in userspace
vma sounds like a good plan. I am even thinking that avoiding all mapping
and encourage pread/pwrite is a better solution. For me vram is a
temporary storage card maker use to speed up their hw as so it should not
be directly used for userspace. Note that this does not go against having
user space choosing policy for vram usage ie which object to put where.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Thomas Hellström
Jerome Glisse wrote:
 On Wed, 14 May 2008 12:09:06 +0200
 Thomas Hellström [EMAIL PROTECTED] wrote:

   
 Jerome Glisse wrote:
 Jerome, Dave, Keith


 1) The inability to map device memory. The design arguments and proposed 
 solution for VRAM are not really valid. Think of this, probably not too 
 uncommon, scenario of a single pixel fallback composite to a scanout 
 buffer in vram. Or a texture or video frame upload:

 A) Page in all GEM pages, because they've been paged out.
 B) Copy the complete scanout buffer to GEM because it's dirty. Untile.
 C) Write the pixel.
 D) Copy the complete buffer back while tiling.
 

 With pwrite/pread you give offset and size of things you are interested in.
 So for single pixel case it will pread a page and pwrite it once fallback
 finished. I totaly agree that dowloading whole object on fallback is to be
 avoided. But as long as we don't have a fallback which draws the whole
 screen then we are fine, and as anyway such fallback will be desastrous
 wether we map vram or not lead me to discard this drawback and just
 accept pain for such fallback.

   
I don't agree with you here. EXA is much faster for small composite 
operations and even small fill blits if fallbacks are used. Even to 
write-combined memory, but that of course depends on the hardware. This 
is going to be even more pronounced with acceleration architectures like 
Glucose and similar, that don't have an optimized path for small 
hardware composite operations.

My personal feeling is that pwrites are a workaround for a workaround 
for a very bad decision:

To avoid user-space allocators on device-mapped memory. This lead to a 
hack to avoid cahing-policy changes which lead to  cache trashing 
problems which put us in the current situation.  How far are we going to 
follow this path before people wake up? What's wrong with the 
performance of good old i915tex which even beats classic i915 in many 
cases.

Having to go through potentially (and even probably) paged-out memory to 
access buffers to make that are present in VRAM sounds like a very odd 
approach (to say the least) to me. Even if it's a single page and 
implementing per-page dirty checks for domain flushing isn't very 
appealing either.

 Also i am confident that we can find a more clever way in such case.
 Like doing the whole rendering in ram and updating the final result
 so assuming that the up to date copy is in ram and that vram might
 be out of sync.
   
Why should we have to when we can do it right?
  
   
 2) Reserving pages when allocating VRAM buffers is also a very bad 
 solution particularly on systems with a lot of VRAM and little system 
 RAM. (Multiple card machines?). GEM basically needs to reserve 
 swap-space when buffers are created, and put a limit on the pinned 
 physical pages.  We basically should not be able to fail memory 
 allocation during execbuf, because we cannot recover from that.
 

 Well this solve the suspend problem we were discussing at xds ie what
 to do on buffer. If we know that we have room to put buffer then we
 don't to worry about which buffer we are ready to loose. Given that
 opengl don't give any clue on that this sounds like a good approach.

 For embedded device where every piece of ram still matter i guess
 you also have to deal with suspend case so you have a way to either
 save vram content or to preserve it. I don't see any problem with
 gem to cop with this case too.
   
No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
video cards, 4G swap space, and you want to fill both card's videoram 
with render-and-forget textures for whatever purpose.

What happens? After you've generated the first say 300M, The system 
mysteriously starts to page, and when, after a a couple of minutes of 
crawling texture upload speeds, you're done, The system is using and 
have written almost 2G of swap. Now, you want to update the textures and 
expect fast texsubimage...

So having a backing object that you have to access to get things into 
VRAM is not the way to go.
The correct way to do this is to reserve, but not use swap space. Then 
you can start using it on suspend, provided that the swapping system is 
still up (which is has to be with the current GEM approach anyway). If 
pwrite is used in this case, it must not dirty any backing object pages.

/Thomas





   
 Other things like GFP_HIGHUSER etc are probably fixable if there is a 
 will to do it.

 So if GEM is the future, these shortcomings must IMHO be addressed. In 
 particular GEM should not stop people from mapping device memory 
 directly. Particularly not in the view of the arguments against TTM 
 previously outlined.
 

 As i said i have come to the opinion that not mapping vram in userspace
 vma sounds like a good plan. I am even thinking that avoiding all mapping
 and encourage pread/pwrite is a better solution. For me vram is a
 temporary storage card maker use to speed up their hw as so it should not
 be 

Re: TTM merging?

2008-05-14 Thread Thomas Hellström
Ben Skeggs wrote:
 1) I feel there hasn't been enough open driver coverage to prove it. So far
 we have done an Intel IGD, we have a lot of code that isn't required for
 these devices, so the question of how much code exists purely to support
 poulsbo closed source userspace there is and why we need to live with it.
 Both radeon and nouveau developers have expressed frustration about the
 fencing internals being really hard to work with which doesn't bode well 
 for
 maintainability in the future.
   
 
 OK. So basically what I'm asking is that when we have full-feathered open
 source drivers available that
 utilize TTM, either as part of DRM core, or, if needed, as part of
 driver-specific code, do you see anything
 else that prevents that from being pushed? That would be very valuable to 
 know
 for anyone starting porting work. ?
   
 I was hoping that by now, one of the radeon or nouveau drivers would have 
 adopted TTM, or at least demoed something working using it, this hasn't 
 happened which worries me, perhaps glisse or darktama could fill in on 
 what limited them from doing it. The fencing internals are very very scary 
 and seem to be a major stumbling block.
 
 The fencing internals do seem overly complicated indeed, but that's
 something that I'm personally OK with taking the time to figure out how
 to get right.  Is there any good documentation around that describes it
 in detail?
   
Yes, there is a wiki page.
http://dri.freedesktop.org/wiki/TTMFencing
 I actually started working on nouveau/ttm again a month or so back, with
 the intention of actually having the work land this time.  Overall, I
 don't have much problem with TTM and would be willing to work with it.
 Supporting G8x/G9x chips was the reason the work's stalled again, I
 wasn't sure at the time what requirements we'd have from a memory
 manager.

 The issue on G8x is that the 3D engine will refuse to render to linear
 surfaces, and in order to setup tiling we need to make use of a
 channel's page tables.  The driver doesn't get any control when VRAM is
 allocated so that it can setup the page tables appropriately etc.  I
 just had a thought that the driver-specific validation ioctl could
 probably handle that at the last minute, so perhaps that's also not an
 issue.  I'll look more into G8x/ttm after I finish my current G8x work.

 Another minor issue (probably doesn't effect merging?): Nouveau makes
 extensive use fence classes, we assign 1 fence class to each GPU channel
 (read: context + command submission mechanism).  We have 128 of these on
 G80 cards, the current _DRM_FENCE_CLASSES is 8 which is insufficient
 even for NV1x hardware.
   
Ouch. Yes it should be OK to bump that as long as kmalloc doesn't complain.
 So overall, I'm basically fine with TTM now that I've actually made a
 proper attempt at using it..  GEM does seem interesting, I'll also
 follow its development while I continue with other non-mm G80 work.

 Cheers,
 Ben.
   
Nice to know Ben. Anyway whatever happens, the fencing code will remain 
for some drivers either device specific or common, so if you find ways 
to simplify or things that doesn't look right, please let me know.

/Thomas




-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Keith Packard
On Wed, 2008-05-14 at 12:09 +0200, Thomas Hellström wrote:

 1) The inability to map device memory. The design arguments and proposed 
 solution for VRAM are not really valid. Think of this, probably not too 
 uncommon, scenario of a single pixel fallback composite to a scanout 
 buffer in vram. Or a texture or video frame upload:

Nothing prevents you from mapping device memory; it's just that on a UMA
device, there's no difference, and some significant advantages to using
the direct mapping. I wrote the API I needed for my device; I think it's
simple enough that other devices can add the APIs they need.

But, what we've learned in the last few months is that mapping *any*
pages into user space is a last-resort mechanism. Mapping pages WC or UC
requires inter-processor interrupts, and using normal WB pages means
invoking clflush on regions written from user space.

The glxgears benchmark demonstrates this with some clarity -- using
pwrite to send batch buffers is nearly three times faster (888 fps using
pwrite vs 300 fps using mmap) than mapping pages to user space and then
clflush'ing them in the kernel.

 A) Page in all GEM pages, because they've been paged out.
 B) Copy the complete scanout buffer to GEM because it's dirty. Untile.
 C) Write the pixel.
 D) Copy the complete buffer back while tiling.

First off, I don't care about fallbacks; any driver using fallbacks is
broken.

Second, if you had to care about fallbacks on non-UMA hardware, you'd
compute the pages necessary for the fallback and only map/copy those
anyway.

 2) Reserving pages when allocating VRAM buffers is also a very bad 
 solution particularly on systems with a lot of VRAM and little system 
 RAM. (Multiple card machines?). GEM basically needs to reserve 
 swap-space when buffers are created, and put a limit on the pinned 
 physical pages.  We basically should not be able to fail memory 
 allocation during execbuf, because we cannot recover from that.

As far as I know, any device using VRAM will not save it across
suspend/resume. From my perspective, this means you don't get a choice
about allocating backing store for that data

Because GEM has backing store, we can limit pinned memory to only those
pages needed for the current operation, waiting to pin pages until the
device is ready to execute the operation. As I said in my earlier email,
that part of the kernel driver is not written yet. I was hoping to get
that finished before launching into this discussion as it is always
better to argue with running code.

 This means that the dependency on SHMEMFS propably needs to be dropped 
 and replaced with some sort of DRMFS that allows overloading of mmap and 
 a correct swap handling, address the caching issue and also avoids the 
 driver do_mmap(). 

Because GEM doesn't expose the use of shmfs to the user, there's no
requirement that all objects use this abstraction. You could even have
multiple object creation functions if that made sense in your driver.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Jerome Glisse
On Wed, 14 May 2008 16:36:54 +0200
Thomas Hellström [EMAIL PROTECTED] wrote:

 Jerome Glisse wrote:
 I don't agree with you here. EXA is much faster for small composite 
 operations and even small fill blits if fallbacks are used. Even to 
 write-combined memory, but that of course depends on the hardware. This 
 is going to be even more pronounced with acceleration architectures like 
 Glucose and similar, that don't have an optimized path for small 
 hardware composite operations.
 
 My personal feeling is that pwrites are a workaround for a workaround 
 for a very bad decision:
 
 To avoid user-space allocators on device-mapped memory. This lead to a 
 hack to avoid cahing-policy changes which lead to  cache trashing 
 problems which put us in the current situation.  How far are we going to 
 follow this path before people wake up? What's wrong with the 
 performance of good old i915tex which even beats classic i915 in many 
 cases.
 
 Having to go through potentially (and even probably) paged-out memory to 
 access buffers to make that are present in VRAM sounds like a very odd 
 approach (to say the least) to me. Even if it's a single page and 
 implementing per-page dirty checks for domain flushing isn't very 
 appealing either.

I don't have number or benchmark to check how fast pread/pwrite path might
be in this use so i am just expressing my feeling which happen to just be
to avoid vma tlb flush as most as we can. I got the feeling that kernel
goes through numerous trick to avoid tlb flushing for a good reason and
also i am pretty sure that with number of core keeping growing anythings
that need cpu broad synchronization is to be avoided.

Hopefully once i got decent amount of time to do benchmark with gem i will
check out my theory. I think simple benchmark can be done on intel hw just
return false in EXA prepare access to force use of download from screen,
and in download from screen use pread then comparing benchmark of this
hacked intel ddx with a normal one should already give some numbers.

 Why should we have to when we can do it right?

Well my point was that mapping vram is not right, i am not saying that
i know the truth. It's just a feeling based on my experiment with ttm
and on the bar restriction stuff and others consideration of same kind.

 No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
 video cards, 4G swap space, and you want to fill both card's videoram 
 with render-and-forget textures for whatever purpose.
 
 What happens? After you've generated the first say 300M, The system 
 mysteriously starts to page, and when, after a a couple of minutes of 
 crawling texture upload speeds, you're done, The system is using and 
 have written almost 2G of swap. Now, you want to update the textures and 
 expect fast texsubimage...
 
 So having a backing object that you have to access to get things into 
 VRAM is not the way to go.
 The correct way to do this is to reserve, but not use swap space. Then 
 you can start using it on suspend, provided that the swapping system is 
 still up (which is has to be with the current GEM approach anyway). If 
 pwrite is used in this case, it must not dirty any backing object pages.
 

For normal desktop i don't expect VRAM amount  RAM amount, people with
1Go VRAM are usually hard gamer with 4G of ram :). Also most object in
3d world are stored in memory, if program are not stupid and trust gl
to keep their texture then you just have the usual ram copy and possibly
a vram copy, so i don't see any waste in the normal use case. Of course
we can always come up with crazy weird setup, but i am more interested
in dealing well with average Joe than dealing mostly well with every
use case.

That said i do see GPGPU as a possible users of temporary big vram buffer
ie buffer you can trash away. For that kind of stuff it does make sense
to not have backing ram/swap area. But i would rather add somethings in
gem like intercepting allocation of such buffer and not creating backing
buffer, or adding driver specific ioctl for that case.

Anyway i think we need benchmark to know what in the end is really the
best option. I don't have code to support my general feeling, so i might
be wrong. Sadly we don't have 2^32 monkeys doing code days and night for
drm to test all solutions :)

Cheers,
Jerome Glisse [EMAIL PROTECTED]

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Keith Whitwell


- Original Message 
 From: Jerome Glisse [EMAIL PROTECTED]
 To: Thomas Hellström [EMAIL PROTECTED]
 Cc: Dave Airlie [EMAIL PROTECTED]; Keith Packard [EMAIL PROTECTED]; DRI 
 dri-devel@lists.sourceforge.net; Dave Airlie [EMAIL PROTECTED]
 Sent: Wednesday, May 14, 2008 6:08:55 PM
 Subject: Re: TTM merging?
 
 On Wed, 14 May 2008 16:36:54 +0200
 Thomas Hellström wrote:
 
  Jerome Glisse wrote:
  I don't agree with you here. EXA is much faster for small composite 
  operations and even small fill blits if fallbacks are used. Even to 
  write-combined memory, but that of course depends on the hardware. This 
  is going to be even more pronounced with acceleration architectures like 
  Glucose and similar, that don't have an optimized path for small 
  hardware composite operations.
  
  My personal feeling is that pwrites are a workaround for a workaround 
  for a very bad decision:
  
  To avoid user-space allocators on device-mapped memory. This lead to a 
  hack to avoid cahing-policy changes which lead to  cache trashing 
  problems which put us in the current situation.  How far are we going to 
  follow this path before people wake up? What's wrong with the 
  performance of good old i915tex which even beats classic i915 in many 
  cases.
  
  Having to go through potentially (and even probably) paged-out memory to 
  access buffers to make that are present in VRAM sounds like a very odd 
  approach (to say the least) to me. Even if it's a single page and 
  implementing per-page dirty checks for domain flushing isn't very 
  appealing either.
 
 I don't have number or benchmark to check how fast pread/pwrite path might
 be in this use so i am just expressing my feeling which happen to just be
 to avoid vma tlb flush as most as we can. I got the feeling that kernel
 goes through numerous trick to avoid tlb flushing for a good reason and
 also i am pretty sure that with number of core keeping growing anythings
 that need cpu broad synchronization is to be avoided.
 
 Hopefully once i got decent amount of time to do benchmark with gem i will
 check out my theory. I think simple benchmark can be done on intel hw just
 return false in EXA prepare access to force use of download from screen,
 and in download from screen use pread then comparing benchmark of this
 hacked intel ddx with a normal one should already give some numbers.
 
  Why should we have to when we can do it right?
 
 Well my point was that mapping vram is not right, i am not saying that
 i know the truth. It's just a feeling based on my experiment with ttm
 and on the bar restriction stuff and others consideration of same kind.
 
  No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
  video cards, 4G swap space, and you want to fill both card's videoram 
  with render-and-forget textures for whatever purpose.
  
  What happens? After you've generated the first say 300M, The system 
  mysteriously starts to page, and when, after a a couple of minutes of 
  crawling texture upload speeds, you're done, The system is using and 
  have written almost 2G of swap. Now, you want to update the textures and 
  expect fast texsubimage...
  
  So having a backing object that you have to access to get things into 
  VRAM is not the way to go.
  The correct way to do this is to reserve, but not use swap space. Then 
  you can start using it on suspend, provided that the swapping system is 
  still up (which is has to be with the current GEM approach anyway). If 
  pwrite is used in this case, it must not dirty any backing object pages.
  
 
 For normal desktop i don't expect VRAM amount  RAM amount, people with
 1Go VRAM are usually hard gamer with 4G of ram :). Also most object in
 3d world are stored in memory, if program are not stupid and trust gl
 to keep their texture then you just have the usual ram copy and possibly
 a vram copy, so i don't see any waste in the normal use case. Of course
 we can always come up with crazy weird setup, but i am more interested
 in dealing well with average Joe than dealing mostly well with every
 use case.

It's always been a big win to go to single-copy texturing.  Textures tend to be 
large and nobody has so much memory that doubling up on textures has ever been 
appealing...  And there are obvious use-cases like textured video where only 
having a single copy is a big performance.

It certainly makes things easier for the driver to duplicate textures -- which 
is why all the old DRI drivers did it -- but it doesn't make it right...  And 
the old DRI drivers also copped out on things like render-to-texture, etc, so 
whatever gains you make in simplicity by treating VRAM as a cache, some of 
those will be lost because you'll have to keep track of which one of the two 
copies of a texture is up-to-date, and you'll still have to preserve (modified) 
texture contents on eviction, which old DRI never had to.

Ultimately it boils down to a choice between making your life easier as a 

Re: TTM merging?

2008-05-14 Thread Jerome Glisse
On Wed, 14 May 2008 10:21:15 -0700 (PDT)
Keith Whitwell [EMAIL PROTECTED] wrote:

  On Wed, 14 May 2008 16:36:54 +0200
  Thomas Hellström wrote:
  
   Jerome Glisse wrote:
   I don't agree with you here. EXA is much faster for small composite 
   operations and even small fill blits if fallbacks are used. Even to 
   write-combined memory, but that of course depends on the hardware. This 
   is going to be even more pronounced with acceleration architectures like 
   Glucose and similar, that don't have an optimized path for small 
   hardware composite operations.
   
   My personal feeling is that pwrites are a workaround for a workaround 
   for a very bad decision:
   
   To avoid user-space allocators on device-mapped memory. This lead to a 
   hack to avoid cahing-policy changes which lead to  cache trashing 
   problems which put us in the current situation.  How far are we going to 
   follow this path before people wake up? What's wrong with the 
   performance of good old i915tex which even beats classic i915 in many 
   cases.
   
   Having to go through potentially (and even probably) paged-out memory to 
   access buffers to make that are present in VRAM sounds like a very odd 
   approach (to say the least) to me. Even if it's a single page and 
   implementing per-page dirty checks for domain flushing isn't very 
   appealing either.
  
  I don't have number or benchmark to check how fast pread/pwrite path might
  be in this use so i am just expressing my feeling which happen to just be
  to avoid vma tlb flush as most as we can. I got the feeling that kernel
  goes through numerous trick to avoid tlb flushing for a good reason and
  also i am pretty sure that with number of core keeping growing anythings
  that need cpu broad synchronization is to be avoided.
  
  Hopefully once i got decent amount of time to do benchmark with gem i will
  check out my theory. I think simple benchmark can be done on intel hw just
  return false in EXA prepare access to force use of download from screen,
  and in download from screen use pread then comparing benchmark of this
  hacked intel ddx with a normal one should already give some numbers.
  
   Why should we have to when we can do it right?
  
  Well my point was that mapping vram is not right, i am not saying that
  i know the truth. It's just a feeling based on my experiment with ttm
  and on the bar restriction stuff and others consideration of same kind.
  
   No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
   video cards, 4G swap space, and you want to fill both card's videoram 
   with render-and-forget textures for whatever purpose.
   
   What happens? After you've generated the first say 300M, The system 
   mysteriously starts to page, and when, after a a couple of minutes of 
   crawling texture upload speeds, you're done, The system is using and 
   have written almost 2G of swap. Now, you want to update the textures and 
   expect fast texsubimage...
   
   So having a backing object that you have to access to get things into 
   VRAM is not the way to go.
   The correct way to do this is to reserve, but not use swap space. Then 
   you can start using it on suspend, provided that the swapping system is 
   still up (which is has to be with the current GEM approach anyway). If 
   pwrite is used in this case, it must not dirty any backing object pages.
   
  
  For normal desktop i don't expect VRAM amount  RAM amount, people with
  1Go VRAM are usually hard gamer with 4G of ram :). Also most object in
  3d world are stored in memory, if program are not stupid and trust gl
  to keep their texture then you just have the usual ram copy and possibly
  a vram copy, so i don't see any waste in the normal use case. Of course
  we can always come up with crazy weird setup, but i am more interested
  in dealing well with average Joe than dealing mostly well with every
  use case.
 
 It's always been a big win to go to single-copy texturing.  Textures tend to 
 be large and nobody has so much memory that doubling up on textures has ever 
 been appealing...  And there are obvious use-cases like textured video where 
 only having a single copy is a big performance.
 
 It certainly makes things easier for the driver to duplicate textures -- 
 which is why all the old DRI drivers did it -- but it doesn't make it 
 right...  And the old DRI drivers also copped out on things like 
 render-to-texture, etc, so whatever gains you make in simplicity by treating 
 VRAM as a cache, some of those will be lost because you'll have to keep track 
 of which one of the two copies of a texture is up-to-date, and you'll still 
 have to preserve (modified) texture contents on eviction, which old DRI never 
 had to.
 
 Ultimately it boils down to a choice between making your life easier as a 
 developer of the driver and producing a driver that makes most advantage of 
 all the system resources.  
 
 Nobody can force you to take one path or 

Re: TTM merging?

2008-05-14 Thread Eric Anholt
On Wed, 2008-05-14 at 10:21 -0700, Keith Whitwell wrote:
 
 - Original Message 
  From: Jerome Glisse [EMAIL PROTECTED]
  To: Thomas Hellström [EMAIL PROTECTED]
  Cc: Dave Airlie [EMAIL PROTECTED]; Keith Packard [EMAIL PROTECTED]; DRI 
  dri-devel@lists.sourceforge.net; Dave Airlie [EMAIL PROTECTED]
  Sent: Wednesday, May 14, 2008 6:08:55 PM
  Subject: Re: TTM merging?
  
  On Wed, 14 May 2008 16:36:54 +0200
  Thomas Hellström wrote:
  
   Jerome Glisse wrote:
   I don't agree with you here. EXA is much faster for small composite 
   operations and even small fill blits if fallbacks are used. Even to 
   write-combined memory, but that of course depends on the hardware. This 
   is going to be even more pronounced with acceleration architectures like 
   Glucose and similar, that don't have an optimized path for small 
   hardware composite operations.
   
   My personal feeling is that pwrites are a workaround for a workaround 
   for a very bad decision:
   
   To avoid user-space allocators on device-mapped memory. This lead to a 
   hack to avoid cahing-policy changes which lead to  cache trashing 
   problems which put us in the current situation.  How far are we going to 
   follow this path before people wake up? What's wrong with the 
   performance of good old i915tex which even beats classic i915 in many 
   cases.
   
   Having to go through potentially (and even probably) paged-out memory to 
   access buffers to make that are present in VRAM sounds like a very odd 
   approach (to say the least) to me. Even if it's a single page and 
   implementing per-page dirty checks for domain flushing isn't very 
   appealing either.
  
  I don't have number or benchmark to check how fast pread/pwrite path might
  be in this use so i am just expressing my feeling which happen to just be
  to avoid vma tlb flush as most as we can. I got the feeling that kernel
  goes through numerous trick to avoid tlb flushing for a good reason and
  also i am pretty sure that with number of core keeping growing anythings
  that need cpu broad synchronization is to be avoided.
  
  Hopefully once i got decent amount of time to do benchmark with gem i will
  check out my theory. I think simple benchmark can be done on intel hw just
  return false in EXA prepare access to force use of download from screen,
  and in download from screen use pread then comparing benchmark of this
  hacked intel ddx with a normal one should already give some numbers.
  
   Why should we have to when we can do it right?
  
  Well my point was that mapping vram is not right, i am not saying that
  i know the truth. It's just a feeling based on my experiment with ttm
  and on the bar restriction stuff and others consideration of same kind.
  
   No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
   video cards, 4G swap space, and you want to fill both card's videoram 
   with render-and-forget textures for whatever purpose.
   
   What happens? After you've generated the first say 300M, The system 
   mysteriously starts to page, and when, after a a couple of minutes of 
   crawling texture upload speeds, you're done, The system is using and 
   have written almost 2G of swap. Now, you want to update the textures and 
   expect fast texsubimage...
   
   So having a backing object that you have to access to get things into 
   VRAM is not the way to go.
   The correct way to do this is to reserve, but not use swap space. Then 
   you can start using it on suspend, provided that the swapping system is 
   still up (which is has to be with the current GEM approach anyway). If 
   pwrite is used in this case, it must not dirty any backing object pages.
   
  
  For normal desktop i don't expect VRAM amount  RAM amount, people with
  1Go VRAM are usually hard gamer with 4G of ram :). Also most object in
  3d world are stored in memory, if program are not stupid and trust gl
  to keep their texture then you just have the usual ram copy and possibly
  a vram copy, so i don't see any waste in the normal use case. Of course
  we can always come up with crazy weird setup, but i am more interested
  in dealing well with average Joe than dealing mostly well with every
  use case.
 
 It's always been a big win to go to single-copy texturing.  Textures
 tend to be large and nobody has so much memory that doubling up on
 textures has ever been appealing...  And there are obvious use-cases
 like textured video where only having a single copy is a big
 performance.

So upload it with pwrite.  Have your driver implementation of pwrite
make some VRAM space, copy it directly in, and mark it as needing to be
synced to backing store if evicted.  You haven't even loaded the pages
of the backing store in, so you haven't allocated that memory.  I'm not
a big fan of this because it seems to leave nasty problems with scaring
up enough memory when you go to suspend/evict, but I'm not the person
writing your driver so it's not my decision.


Re: TTM merging?

2008-05-14 Thread Eric Anholt
On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote:
  2) Reserving pages when allocating VRAM buffers is also a very bad 
  solution particularly on systems with a lot of VRAM and little system 
  RAM. (Multiple card machines?). GEM basically needs to reserve 
  swap-space when buffers are created, and put a limit on the pinned 
  physical pages.  We basically should not be able to fail memory 
  allocation during execbuf, because we cannot recover from that.
  
 
  Well this solve the suspend problem we were discussing at xds ie what
  to do on buffer. If we know that we have room to put buffer then we
  don't to worry about which buffer we are ready to loose. Given that
  opengl don't give any clue on that this sounds like a good approach.
 
  For embedded device where every piece of ram still matter i guess
  you also have to deal with suspend case so you have a way to either
  save vram content or to preserve it. I don't see any problem with
  gem to cop with this case too.

 No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
 video cards, 4G swap space, and you want to fill both card's videoram 
 with render-and-forget textures for whatever purpose.

Who's selling that system?  Who's building that system at home?

-- 
Eric Anholt [EMAIL PROTECTED]
[EMAIL PROTECTED] [EMAIL PROTECTED]



signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Eric Anholt
On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote:
  The real question is whether TTM suits the driver writers for use in Linux 
  desktop and embedded environments, and I think so far I'm not seeing 
  enough positive feedback from the desktop side.

 I actually haven't seen much feedback at all. At least not on the 
 mailing lists.
 Anyway we need to look at the alternatives which currently is GEM.
 
 GEM, while still in development basically brings  us back to the 
 functionality of TTM 0.1, with added paging support but without 
 fine-grained locking and  caching policy support.
 
 I might have misunderstood things but quickly browsing the code raises 
 some obvious questions:
 
 1) Some AGP chipsets don't support page addresses  32bits. GEM objects 
 use GFP_HIGHUSER, and it's hardcoded into the linux swap code.

The obvious solution here is what many DMA APIs do for IOMMUs that can't
address all of memory -- keep a pool of pages within the addressable
range and bounce data through them.  I think the Linux kernel even has
interfaces to support us in this.  Since it's not going to be a very
common case, we may not care about the performance.  If we do find that
we care about the performance, we should first attempt to get what we
need into the linux kernel so we don't have to duplicate code, and only
if that fails do the duplication.

I'm pretty sure the AGP chipsets versus 32-bits pages danger has been
overstated, though.  Besides the fact that you needed to load one of
these older supposed machines with a full 4GB of memory (well,
theoretically 3.5GB but how often can you even boot a system with a 2,
1, .5gb combo?), you also need a chipset that does 32-bit addressing.

At least all AMD and Intel chipsets don't appear to have this problem in
the survey I did last night, as they've either got 32-bit chipset and
32-bit gart, or 32-bit chipset and 32-bit gart.  Basically all I'm
worried about is ATI PCI[E]GART at this point.

http://dri.freedesktop.org/wiki/GARTAddressingLimits

snip bits that have been covered in other mails

 5) What's protecting i915 GEM object privates and lists in a 
 multi-threaded environment?

Nothing at the moment.  That's my current project.  dev-struct_mutex is
the plan -- I don't want to see finer-grained locking until we show that
contention on that locking is an issue.  Fine-grained locking takes
significant care, and there's a lot more important performance
improvements to work on before then.

 6) Isn't do_mmap() strictly forbidden in new drivers? I remember seeing 
 some severe ranting about it on the lkml?

We've talked it over with Arjan, and until we can use real fds as our
handles to objects, he thought it sounded OK.  But apparently Al Viro's
working on making it so that allocating a thousand fds would be feasible
for us.  At that point mmap/pread/pwrite/close ioctls could be replaced
with the syscalls they were named for, and the kernel guys love us.

 TTM is designed to cope with most hardware quirks I've come across with 
 different chipsets so far, including Intel UMA, Unichrome, Poulsbo, and 
 some other ones. GEM basically leaves it up to the driver writer to 
 reinvent the wheel..

The problem with TTM is that it's designed to expose one general API for
all hardware, when that's not what our drivers want.  The GPU-GPU cache
handling for intel, for example, mapped the hardware so poorly that
every batch just flushed everything.  Bolting on the clflush-based
cpu-gpu caching management for our platform recovered a lot of
performance, but we're still having to reuse buffers in userland at a
memory cost because allocating buffers is overly expensive for the
general supporting-everybody (but oops, it's not swappable!) object
allocator.

We're trying to come at it from the other direction: Implement one
driver well.  When someone else implements another driver and finds that
there's code that should be common, make it into a support library and
share it.

I actually would have liked the whole interface to userland to be
driver-specific with a support library for the parts we think other
people would want, but DRI2 wants to use buffer objects for its shared
memory transport and I didn't want to rock its boat too hard, so the
ioctls that should be supportable for everyone got moved to generic.

If the implementation of those ioctls in generic code doesn't work for
some drivers (say, early shmfs object creation turns out to be a bad
idea for VRAM drivers), I'll happily push it out to the driver.

-- 
Eric Anholt [EMAIL PROTECTED]
[EMAIL PROTECTED] [EMAIL PROTECTED]



signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--

Re: TTM merging?

2008-05-14 Thread Keith Packard
On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote:

 My personal feeling is that pwrites are a workaround for a workaround 
 for a very bad decision

Feel free to map VRAM then if you can; I didn't need to on Intel as
there isn't any difference.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Keith Packard
On Wed, 2008-05-14 at 19:08 +0200, Jerome Glisse wrote:

 I don't have number or benchmark to check how fast pread/pwrite path might
 be in this use so i am just expressing my feeling which happen to just be
 to avoid vma tlb flush as most as we can. 

For batch buffers, pwrite is 3X faster than map/write/unmap, at least as
measured by that most estimable benchmark 'glxgears'. Take that with as
much skepticism as it deserves.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Keith Packard
On Wed, 2008-05-14 at 10:21 -0700, Keith Whitwell wrote:

 Nobody can force you to take one path or the other, but it's certainly
 my intention when considering drivers for VRAM hardware to support
 single-copy-number textures, and for that reason, I'd be unhappy to
 see a system adopted that prevented that.

And, GEM on UMA does single-copy texture updates, just as TTM does.
From an object management perspective, GEM isn't very different from
TTM, it's just that the current code is written for UMA, and no-one has
shown code for either of these running on non-UMA hardware.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Thomas Hellström
Eric Anholt wrote:
 On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote:
   
 The real question is whether TTM suits the driver writers for use in Linux 
 desktop and embedded environments, and I think so far I'm not seeing 
 enough positive feedback from the desktop side.
   
   
 I actually haven't seen much feedback at all. At least not on the 
 mailing lists.
 Anyway we need to look at the alternatives which currently is GEM.

 GEM, while still in development basically brings  us back to the 
 functionality of TTM 0.1, with added paging support but without 
 fine-grained locking and  caching policy support.

 I might have misunderstood things but quickly browsing the code raises 
 some obvious questions:

 1) Some AGP chipsets don't support page addresses  32bits. GEM objects 
 use GFP_HIGHUSER, and it's hardcoded into the linux swap code.
 

 The obvious solution here is what many DMA APIs do for IOMMUs that can't
 address all of memory -- keep a pool of pages within the addressable
 range and bounce data through them.  I think the Linux kernel even has
 interfaces to support us in this.  Since it's not going to be a very
 common case, we may not care about the performance.  If we do find that
 we care about the performance, we should first attempt to get what we
 need into the linux kernel so we don't have to duplicate code, and only
 if that fails do the duplication.

 I'm pretty sure the AGP chipsets versus 32-bits pages danger has been
 overstated, though.  Besides the fact that you needed to load one of
 these older supposed machines with a full 4GB of memory (well,
 theoretically 3.5GB but how often can you even boot a system with a 2,
 1, .5gb combo?), you also need a chipset that does 32-bit addressing.

 At least all AMD and Intel chipsets don't appear to have this problem in
 the survey I did last night, as they've either got 32-bit chipset and
   
 32-bit gart, or 32-bit chipset and 32-bit gart.  Basically all I'm
 
 worried about is ATI PCI[E]GART at this point.

 http://dri.freedesktop.org/wiki/GARTAddressingLimits

 snip bits that have been covered in other mails

   
There will probably turn up a couple of more devices or incomplete 
drivers, but in the long run this is a fixable problem.
 5) What's protecting i915 GEM object privates and lists in a 
 multi-threaded environment?
 

 Nothing at the moment.  That's my current project.  dev-struct_mutex is
 the plan -- I don't want to see finer-grained locking until we show that
 contention on that locking is an issue.  Fine-grained locking takes
 significant care, and there's a lot more important performance
 improvements to work on before then.

   
 6) Isn't do_mmap() strictly forbidden in new drivers? I remember seeing 
 some severe ranting about it on the lkml?
 

 We've talked it over with Arjan, and until we can use real fds as our
 handles to objects, he thought it sounded OK.  But apparently Al Viro's
 working on making it so that allocating a thousand fds would be feasible
 for us.  At that point mmap/pread/pwrite/close ioctls could be replaced
 with the syscalls they were named for, and the kernel guys love us.

   
 TTM is designed to cope with most hardware quirks I've come across with 
 different chipsets so far, including Intel UMA, Unichrome, Poulsbo, and 
 some other ones. GEM basically leaves it up to the driver writer to 
 reinvent the wheel..
 

 The problem with TTM is that it's designed to expose one general API for
 all hardware, when that's not what our drivers want.  The GPU-GPU cache
 handling for intel, for example, mapped the hardware so poorly that
 every batch just flushed everything.  Bolting on the clflush-based
 cpu-gpu caching management for our platform recovered a lot of
 performance, but we're still having to reuse buffers in userland at a
 memory cost because allocating buffers is overly expensive for the
 general supporting-everybody (but oops, it's not swappable!) object
 allocator.

   
Swapping drmBOs is a couple of days implementation and some core kernel 
exports. It's just that someone needs find the time and the right person 
to talk to in the right way to get certain swapping functions exported.
 We're trying to come at it from the other direction: Implement one
 driver well.  When someone else implements another driver and finds that
 there's code that should be common, make it into a support library and
 share it.

 I actually would have liked the whole interface to userland to be
 driver-specific with a support library for the parts we think other
 people would want, but DRI2 wants to use buffer objects for its shared
 memory transport and I didn't want to rock its boat too hard, so the
 ioctls that should be supportable for everyone got moved to generic.

 If the implementation of those ioctls in generic code doesn't work for
 some drivers (say, early shmfs object creation turns out to be a bad
 idea for VRAM drivers), I'll happily push it out to the driver.


Re: TTM merging?

2008-05-14 Thread Alex Deucher
On Wed, May 14, 2008 at 2:30 PM, Eric Anholt [EMAIL PROTECTED] wrote:
 On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote:
The real question is whether TTM suits the driver writers for use in 
 Linux
desktop and embedded environments, and I think so far I'm not seeing
enough positive feedback from the desktop side.
   
   I actually haven't seen much feedback at all. At least not on the
   mailing lists.
   Anyway we need to look at the alternatives which currently is GEM.
  
   GEM, while still in development basically brings  us back to the
   functionality of TTM 0.1, with added paging support but without
   fine-grained locking and  caching policy support.
  
   I might have misunderstood things but quickly browsing the code raises
   some obvious questions:
  
   1) Some AGP chipsets don't support page addresses  32bits. GEM objects
   use GFP_HIGHUSER, and it's hardcoded into the linux swap code.

  The obvious solution here is what many DMA APIs do for IOMMUs that can't
  address all of memory -- keep a pool of pages within the addressable
  range and bounce data through them.  I think the Linux kernel even has
  interfaces to support us in this.  Since it's not going to be a very
  common case, we may not care about the performance.  If we do find that
  we care about the performance, we should first attempt to get what we
  need into the linux kernel so we don't have to duplicate code, and only
  if that fails do the duplication.

  I'm pretty sure the AGP chipsets versus 32-bits pages danger has been
  overstated, though.  Besides the fact that you needed to load one of
  these older supposed machines with a full 4GB of memory (well,
  theoretically 3.5GB but how often can you even boot a system with a 2,
  1, .5gb combo?), you also need a chipset that does 32-bit addressing.

  At least all AMD and Intel chipsets don't appear to have this problem in
  the survey I did last night, as they've either got 32-bit chipset and
  32-bit gart, or 32-bit chipset and 32-bit gart.  Basically all I'm
  worried about is ATI PCI[E]GART at this point.

AMD PCIE and IGP GART support 40 bits (Dave just committed support
this morning) so we should be fine on r3xx and newer PCIE cards.

Alex

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Thomas Hellström
Keith Packard wrote:
 On Wed, 2008-05-14 at 10:21 -0700, Keith Whitwell wrote:

   
 Nobody can force you to take one path or the other, but it's certainly
 my intention when considering drivers for VRAM hardware to support
 single-copy-number textures, and for that reason, I'd be unhappy to
 see a system adopted that prevented that.
 

 And, GEM on UMA does single-copy texture updates, just as TTM does.
 From an object management perspective, GEM isn't very different from
 TTM, it's just that the current code is written for UMA, and no-one has
 shown code for either of these running on non-UMA hardware.
   

That's not exactly true.
Stolen memory behaves just like VRAM from a driver writer's perspective.
Implemented as VRAM on i915 modesetting-101 and Psb.

/Thomas





-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Thomas Hellström
Keith Packard wrote:
 On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote:

   
 My personal feeling is that pwrites are a workaround for a workaround 
 for a very bad decision
 

 Feel free to map VRAM then if you can; I didn't need to on Intel as
 there isn't any difference.

   

With mapping device memory on UMA devices I'm referring to mapping 
through the GTT aperture. Either as stolen memory, Pre-bound GTT pools 
or simply buffer object memory temporarily bound to the GTT.

As you've previously mentioned, this requires caching policy changes and 
it needs to be used with some care.

/Thomas





-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Thomas Hellström
Eric Anholt wrote:

 If the implementation of those ioctls in generic code doesn't work for
 some drivers (say, early shmfs object creation turns out to be a bad
 idea for VRAM drivers), I'll happily push it out to the driver.

   
Or perhaps use generic ioctls, but provide hooks in the driver to 
overload the core GEM functions with other implementations.

/Thomas





-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


[Bug 10681] No DRM and DRI support for Trident-based chips

2008-05-14 Thread bugme-daemon
http://bugzilla.kernel.org/show_bug.cgi?id=10681





--- Comment #2 from [EMAIL PROTECTED]  2008-05-14 13:37 ---
Oh, well. I'm screwed then... ;)

Two questions still haunt me by night, There actually was DRI support in the
2.4.x branch of the kernel? If so, Why it got broken?

Anyway, I suppose the source is avialable form freedesktop.org and kernel.org,
so I could check it; and make sure I'm absolutely inable to understan a single
line =P

Thanks for the response.

ps: are you Alan Hourihane, the One that go through the swamp of VIA
documentation and ended up with a working driver?


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


[Bug 14840] radeon 60 wakeups after resume

2008-05-14 Thread bugzilla-daemon
http://bugs.freedesktop.org/show_bug.cgi?id=14840


Jose Rodriguez [EMAIL PROTECTED] changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED




--- Comment #8 from Jose Rodriguez [EMAIL PROTECTED]  2008-05-14 15:12:57 PST 
---
Rebuilt everything today and it works fine. Issue with OA may have to do with
some mess I have with two versions of it mixed.

Thanks

Jose


-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Keith Packard
On Wed, 2008-05-14 at 21:41 +0200, Thomas Hellström wrote:

 As you've previously mentioned, this requires caching policy changes and 
 it needs to be used with some care.

I did't need that in my drivers as GEM handles the WB - GPU object
transfer already.

Object mapping is really the least important part of the system; it
should only be necessary when your GPU is deficient, or your API so
broken as to require this inefficient mechanism. I suspect we'll be
tracking 965 performance as we work to eliminate mapping, we should see
a steady increase until we're no longer mapping anything that the GPU
uses into the application's address space.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Eric Anholt
On Wed, 2008-05-14 at 21:51 +0200, Thomas Hellström wrote:
 Eric Anholt wrote:
 
  If the implementation of those ioctls in generic code doesn't work for
  some drivers (say, early shmfs object creation turns out to be a bad
  idea for VRAM drivers), I'll happily push it out to the driver.
 

 Or perhaps use generic ioctls, but provide hooks in the driver to 
 overload the core GEM functions with other implementations.

Yeah, that's what I was thinking.

-- 
Eric Anholt [EMAIL PROTECTED]
[EMAIL PROTECTED] [EMAIL PROTECTED]



signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Allen Akin
On Wed, May 14, 2008 at 03:48:47PM -0700, Keith Packard wrote:
| Object mapping is really the least important part of the system; it
| should only be necessary when your GPU is deficient, or your API so
| broken as to require this inefficient mechanism.

In the OpenGL case, object mapping wasn't originally a part of the API.
It was added because people building hardware and apps for Intel-based
PCs determined that it was worthwhile, and demanded it.

This wasn't on my watch, so I can't give you the history in detail, but
my recollection is that the primary uses were texture loading for games
and video apps, and incremental changes to vertex arrays for games and
rendering apps.

So maybe the hardware has changed sufficiently that the old reasoning
and performance measurements are no longer valid.  It would still be
good to know for sure that eliminating low-level support for the
mechanism won't be drastically bad for the classes of apps that use it.

Allen

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Keith Packard
On Wed, 2008-05-14 at 16:34 -0700, Allen Akin wrote:
 On Wed, May 14, 2008 at 03:48:47PM -0700, Keith Packard wrote:
 | Object mapping is really the least important part of the system; it
 | should only be necessary when your GPU is deficient, or your API so
 | broken as to require this inefficient mechanism.
 
 In the OpenGL case, object mapping wasn't originally a part of the API.
 It was added because people building hardware and apps for Intel-based
 PCs determined that it was worthwhile, and demanded it.

In a UMA environment, it seems so obvious to map objects into the
application and just bypass the whole kernel API issue. That, however,
ignores caching effects, which appear to dominate performance effects
these days.

 This wasn't on my watch, so I can't give you the history in detail, but
 my recollection is that the primary uses were texture loading for games
 and video apps, and incremental changes to vertex arrays for games and
 rendering apps.

Most of which can be efficiently performed with a pwrite-like system
where the application explicitly tells the system which portions of the
object to modify. Again, it seems insane when everything is a uniform
mass of pages, except for the subtle differences in cache behaviour.

 So maybe the hardware has changed sufficiently that the old reasoning
 and performance measurements are no longer valid.  It would still be
 good to know for sure that eliminating low-level support for the
 mechanism won't be drastically bad for the classes of apps that use it.

I'm not sure we can (or want to) eliminate it entirely, all that I
discovered was that it should be avoided as it has negative performance
consequences. Not dire, but certainly not positive either.

I don't know how old these measurements were, but certainly the gap
between CPU and memory speed has been rapidly increasing for years,
along with cache sizes, both of which have a fairly dramatic effect on
how best to access actual memory.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Allen Akin
On Wed, May 14, 2008 at 05:22:06PM -0700, Keith Packard wrote:
| On Wed, 2008-05-14 at 16:34 -0700, Allen Akin wrote:
|  In the OpenGL case, object mapping wasn't originally a part of the API.
|  It was added because people building hardware and apps for Intel-based
|  PCs determined that it was worthwhile, and demanded it.
| 
| In a UMA environment, it seems so obvious to map objects into the
| application and just bypass the whole kernel API issue. That, however,
| ignores caching effects, which appear to dominate performance effects
| these days.

I think the confusion arises because the mechanism is used for several
purposes, some of which are likely to be dominated by cache effects on
some implementations, and others that aren't.  I'm thinking about the
differences between piecemeal updating of the elements of a vertex
array, versus grabbing an image from a video capture card or a direct
read() from a file into a texture buffer.  The API is intended to allow
apps and drivers to make intelligent choices between cases like those.
Check out BufferData() and MapBuffer() in section 2.9 of the OpenGL 2.1
spec for a discussion which specifically mentions cache effects.

|  This wasn't on my watch, so I can't give you the history in detail, but
|  my recollection is that the primary uses were texture loading for games
|  and video apps, and incremental changes to vertex arrays for games and
|  rendering apps.
| 
| Most of which can be efficiently performed with a pwrite-like system
| where the application explicitly tells the system which portions of the
| object to modify. ...

Interfaces of that style are present in OpenGL, and predate the mapping
interfaces.  I know they were regarded as too slow for some apps, so the
mapping interfaces were added.  The early extensions were driven by
vendors who didn't support UMA, so that couldn't have been the only
model they were concerned about.  Beyond that I'm not sure.

|  So maybe the hardware has changed sufficiently that the old reasoning
|  and performance measurements are no longer valid.  It would still be
|  good to know for sure that eliminating low-level support for the
|  mechanism won't be drastically bad for the classes of apps that use it.
| 
| I'm not sure we can (or want to) eliminate it entirely, all that I
| discovered was that it should be avoided as it has negative performance
| consequences. Not dire, but certainly not positive either.
| 
| I don't know how old these measurements were, but certainly the gap
| between CPU and memory speed has been rapidly increasing for years,
| along with cache sizes, both of which have a fairly dramatic effect on
| how best to access actual memory.

The first reference I can find to an object-mapping API in OpenGL is
from 2001.  I'm sure the vendors had implementations internally before
then, but that's when things were mature enough to start standardizing.
Since the functionality is present in OpenGL 2.0 (vintage 2006?),
apparently someone thought it was still useful enough to carry over from
OpenGL 1.X.

Again, sorry I don't know the entire history on this one.

Allen

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM merging?

2008-05-14 Thread Thomas Hellström
Keith Packard wrote:
 On Wed, 2008-05-14 at 21:41 +0200, Thomas Hellström wrote:

   
 As you've previously mentioned, this requires caching policy changes and 
 it needs to be used with some care.
 

 I did't need that in my drivers as GEM handles the WB - GPU object
 transfer already.

 Object mapping is really the least important part of the system; it
 should only be necessary when your GPU is deficient, or your API so
 broken as to require this inefficient mechanism. I suspect we'll be
 tracking 965 performance as we work to eliminate mapping, we should see
 a steady increase until we're no longer mapping anything that the GPU
 uses into the application's address space.

   
Static wc-d maps into fairly static objects like scanout buffers or 
buffer pools are not inefficient. They provide the by far highest 
throughput for writing (even beats cache-coherent). But they may take 
some time to set up or tear down, which means you should avoid that as 
much as possible.
 
For things like scanout buffers or video buffers you should really use 
such mappings, otherwise you lose big.

The current implementation of GEM that doesn't allow overloading of the 
core GEM functions blocked the possibility to set up such mappings. This 
is about to change, and I'm happy with that.

/Thomas








-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel