Re: Doing better than CS ioctl ?
On Wed, 2009-08-12 at 15:27 +0100, Keith Whitwell wrote: Dave, The big problem with the (second) radeon approach of state objects was that we defined those objects statically encoded them into the kernel interface. That meant that when new hardware functionality was needed (or discovered) we had to rev the kernel interface, usually in a fairly ugly way. I think Jerome's approach could be a good improvement if the state objects it creates are defined by software at runtime, more like little display lists than pre-defined state atoms. The danger again is that you run into cases where you need to expand objects the verifier will allow userspace to create, but at least in doing so you won't be breaking existing users of the interface. I think the key is that there should be no pre-defined format for these state objects, simply that they should be a sequence of legal commands/register writes that the kernel validates once and userspace can execute multiple times. Keith My idea was to have state grouped together according to what they matter for. Like glstate of texture or object in nvidia hw. Idea is that most of the object can be valid only if we know all of their state. For renderbuffer we need to know its format, size, tiling, for texture we need to know its format, size, mipmap levels and possibly others state and so on and so forth. If we just take arbitrary packet from userspace we might end up in situation hard to decipher. If one validated cs program a renderbuffer and other state like zbuffer it might be valid but now if we combine it with another validated cs things might be completely wrong, this another cs might just change clipping and renderbuffer size but not update the zbuffer so we might endup rendering to zbuffer either too small or too big (too small is what we don't want to do ;)). So in the end you need to enforce a set of register onto userspace. Userspace need to submit a cs which program at least this set to be validated, we can have different set like renderbufferset(clipping, scissor,colorbuffer,zbuffer registers), vertex set(vbo,...), shaders set(shaders reg) and then you can combine different set to do the rendering. I think splitting states matter because you often render to some buffer but with different vbo or pixel shader or vertex shader or primitive, so it sounds better to split states. There i think we endup pretty much to what i proposed. Thing is, i don't think packet format is the best to communicate with the kernel as kernel will have to parse the buffer and this is resource consuming not to mention that tracking states that way it bit painfull. I think state object with structure defined per asic (r3xx, r5xx, r6xx) are better, no parsing, clear split of each value and easy access to check that all together they do somethings allowed and then easy and quick for the kernel to build the packet out of this. On the backward incompatibilities side it's not harder to expand those states : struct radeon_state { u32 state_id; u64 state_struct_ptr; }; version 1: state_id = 0x501 struct rv515_texture { u32 width; u32 height; ... }; version 2: state_id = 0x502 struct rv515_texture { u32 width; u32 height; ... u32 texture_pixel_sampling_center; /* well anythings new */ }; So from user pov it could still use the 0x501 and kernel will just ignore the end of the structure and will set default safe value for those. If userspace space submit a 0x502 then it's assume that it knows about new state and kernel will take them into account. I don't think this add more works or code than adding new packet to a parser. Anyway the biggest problem of any of such approach is that we need to figure out how to allocate memory to store either validated cs or kernel built packet on behalf of the program, we don't want to abuse kernel memory allocation. And we can't allow userspace to modify those object after they had been validated :) Cheers, Jerome -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: Doing better than CS ioctl ?
Dave, The big problem with the (second) radeon approach of state objects was that we defined those objects statically encoded them into the kernel interface. That meant that when new hardware functionality was needed (or discovered) we had to rev the kernel interface, usually in a fairly ugly way. I think Jerome's approach could be a good improvement if the state objects it creates are defined by software at runtime, more like little display lists than pre-defined state atoms. The danger again is that you run into cases where you need to expand objects the verifier will allow userspace to create, but at least in doing so you won't be breaking existing users of the interface. I think the key is that there should be no pre-defined format for these state objects, simply that they should be a sequence of legal commands/register writes that the kernel validates once and userspace can execute multiple times. Keith On Sat, 2009-08-08 at 05:43 -0700, Dave Airlie wrote: On Sat, Aug 8, 2009 at 7:51 AM, Jerome Glissegli...@freedesktop.org wrote: Investigating where time is spent in radeon/kms world when doing rendering leaded me to question the design of CS ioctl. As i am among the people behind it, i think i should give some historical background on the choice that were made. I think this sounds quite like the original radeon interface or maybe even a bit like the second one. The original one stored the registers in the sarea, and updated the context under the lock, and had the kernel emit it. The sceond one had a bunch of state objects, containing ranges of registers that were safe to emit. Maybe Keith Whitwell can point out why these were a good/bad idea, not sure if anyone else remembers that far back. Dave. The first motivation behind cs ioctl was to take common language btw userspace and kernel and btw kernel and device. Of course in an ideal world command submitted through cs ioctl could directly be forwarded to the GPU without much overhead. Thing is, the world we leave in isn't that good. There is 2 things the cs ioctl do before forwarding command: 1- First it must rewrite any packet which supply an offset to GPU with the address the memory manager validate the buffer object associated to this packet. We can't get rid of this with the cs ioctl (we might do somethings very clever like doing a new microcode for the cp so that cp can rewrite packet using some table of validated buffer offset but i am not even sure cp would be powerful enough to do that). 2- In order to provide a more advanced security than what we did have in the past i added a cs checker facility which is responsible to analyze the command stream and make sure that the GPU won't read or write outside the supplied buffer object list. DRI1 didn't offered such advanced checking. This feature was added with GPU sharing in mind where sensible application might run on the GPU and for which we might like to protect their memory. We can obviously avoid the second item and things would work but userspace would be able to abuse the GPU to access outside the GPU object its own (this doesn't means it will be able to access any system ram but rather any ram that is mapped to GPU which should for the time being only be pixmap, texture, vbo or things like that). Bottom line is that with cs ioctl we do 2 times a different work. In userspace we build a command stream under stable by the GPU and in kernel space we unencode this command stream to check it. Obviously this sounds wrong. That being said, CS ioctl isn't that bad, it doesn't consume much on benchmark i have done but i expect it might consume a more on older cpu or when many complex 3D apps run at the same time. So i am not proposing to trash it away but rather to discuss about a better interface we could add at latter point to slowly replace cs. CS is bringing today feature we needed yesterday so we should focus our effort on getting cs ioctl as smooth and good as possible. So as a pet project i have been thinking this last few days of what would be a better interface btw userspace and kernel and i come up with somethings in btw gallium state object and nvidia gpu object (well at least as far as i know each of this my design sounds close to that). Idea behind design is that whenever userspace allocate a bo, userspace knows about properties of the bo. If it's a texture userspace knows the size, the number of mipmap level, the border,... of the textur. If it's a vbo it's knows the layout the size, number of elements, ... same for rendering viewport it knows the size and associated properties Design 2 ioctl: create_object : supply : - object type id specific to asic - object structure associated to type id, fully describing the object
Re: Doing better than CS ioctl ?
On Sat, Aug 8, 2009 at 7:51 AM, Jerome Glissegli...@freedesktop.org wrote: Investigating where time is spent in radeon/kms world when doing rendering leaded me to question the design of CS ioctl. As i am among the people behind it, i think i should give some historical background on the choice that were made. I think this sounds quite like the original radeon interface or maybe even a bit like the second one. The original one stored the registers in the sarea, and updated the context under the lock, and had the kernel emit it. The sceond one had a bunch of state objects, containing ranges of registers that were safe to emit. Maybe Keith Whitwell can point out why these were a good/bad idea, not sure if anyone else remembers that far back. Dave. The first motivation behind cs ioctl was to take common language btw userspace and kernel and btw kernel and device. Of course in an ideal world command submitted through cs ioctl could directly be forwarded to the GPU without much overhead. Thing is, the world we leave in isn't that good. There is 2 things the cs ioctl do before forwarding command: 1- First it must rewrite any packet which supply an offset to GPU with the address the memory manager validate the buffer object associated to this packet. We can't get rid of this with the cs ioctl (we might do somethings very clever like doing a new microcode for the cp so that cp can rewrite packet using some table of validated buffer offset but i am not even sure cp would be powerful enough to do that). 2- In order to provide a more advanced security than what we did have in the past i added a cs checker facility which is responsible to analyze the command stream and make sure that the GPU won't read or write outside the supplied buffer object list. DRI1 didn't offered such advanced checking. This feature was added with GPU sharing in mind where sensible application might run on the GPU and for which we might like to protect their memory. We can obviously avoid the second item and things would work but userspace would be able to abuse the GPU to access outside the GPU object its own (this doesn't means it will be able to access any system ram but rather any ram that is mapped to GPU which should for the time being only be pixmap, texture, vbo or things like that). Bottom line is that with cs ioctl we do 2 times a different work. In userspace we build a command stream under stable by the GPU and in kernel space we unencode this command stream to check it. Obviously this sounds wrong. That being said, CS ioctl isn't that bad, it doesn't consume much on benchmark i have done but i expect it might consume a more on older cpu or when many complex 3D apps run at the same time. So i am not proposing to trash it away but rather to discuss about a better interface we could add at latter point to slowly replace cs. CS is bringing today feature we needed yesterday so we should focus our effort on getting cs ioctl as smooth and good as possible. So as a pet project i have been thinking this last few days of what would be a better interface btw userspace and kernel and i come up with somethings in btw gallium state object and nvidia gpu object (well at least as far as i know each of this my design sounds close to that). Idea behind design is that whenever userspace allocate a bo, userspace knows about properties of the bo. If it's a texture userspace knows the size, the number of mipmap level, the border,... of the textur. If it's a vbo it's knows the layout the size, number of elements, ... same for rendering viewport it knows the size and associated properties Design 2 ioctl: create_object : supply : - object type id specific to asic - object structure associated to type id, fully describing the object return : - object id processing : - check that the state provided are correct and check that the bo is big enough for the state - translate state into packet stream - store the object and packet stream associated object id batchs : supply : - table of batch process : - check each batch and schedule them Each batch is a set of object id and userspace need to provide all object id for the batch to be valid. For instance if shader object id needs 5 texture, batch needs to have 5 texture object id supplied. Checking that a batch is valid is quick as it's a set of already checked object. You create object just after creating the bo (if it's a pixmap you can create a texture and viewport just after and whenever you want to use this pixmap just use the proper object id). This means
Doing better than CS ioctl ?
Investigating where time is spent in radeon/kms world when doing rendering leaded me to question the design of CS ioctl. As i am among the people behind it, i think i should give some historical background on the choice that were made. The first motivation behind cs ioctl was to take common language btw userspace and kernel and btw kernel and device. Of course in an ideal world command submitted through cs ioctl could directly be forwarded to the GPU without much overhead. Thing is, the world we leave in isn't that good. There is 2 things the cs ioctl do before forwarding command: 1- First it must rewrite any packet which supply an offset to GPU with the address the memory manager validate the buffer object associated to this packet. We can't get rid of this with the cs ioctl (we might do somethings very clever like doing a new microcode for the cp so that cp can rewrite packet using some table of validated buffer offset but i am not even sure cp would be powerful enough to do that). 2- In order to provide a more advanced security than what we did have in the past i added a cs checker facility which is responsible to analyze the command stream and make sure that the GPU won't read or write outside the supplied buffer object list. DRI1 didn't offered such advanced checking. This feature was added with GPU sharing in mind where sensible application might run on the GPU and for which we might like to protect their memory. We can obviously avoid the second item and things would work but userspace would be able to abuse the GPU to access outside the GPU object its own (this doesn't means it will be able to access any system ram but rather any ram that is mapped to GPU which should for the time being only be pixmap, texture, vbo or things like that). Bottom line is that with cs ioctl we do 2 times a different work. In userspace we build a command stream under stable by the GPU and in kernel space we unencode this command stream to check it. Obviously this sounds wrong. That being said, CS ioctl isn't that bad, it doesn't consume much on benchmark i have done but i expect it might consume a more on older cpu or when many complex 3D apps run at the same time. So i am not proposing to trash it away but rather to discuss about a better interface we could add at latter point to slowly replace cs. CS is bringing today feature we needed yesterday so we should focus our effort on getting cs ioctl as smooth and good as possible. So as a pet project i have been thinking this last few days of what would be a better interface btw userspace and kernel and i come up with somethings in btw gallium state object and nvidia gpu object (well at least as far as i know each of this my design sounds close to that). Idea behind design is that whenever userspace allocate a bo, userspace knows about properties of the bo. If it's a texture userspace knows the size, the number of mipmap level, the border,... of the textur. If it's a vbo it's knows the layout the size, number of elements, ... same for rendering viewport it knows the size and associated properties Design 2 ioctl: create_object : supply : - object type id specific to asic - object structure associated to type id, fully describing the object return : - object id processing : - check that the state provided are correct and check that the bo is big enough for the state - translate state into packet stream - store the object and packet stream associated object id batchs : supply : - table of batch process : - check each batch and schedule them Each batch is a set of object id and userspace need to provide all object id for the batch to be valid. For instance if shader object id needs 5 texture, batch needs to have 5 texture object id supplied. Checking that a batch is valid is quick as it's a set of already checked object. You create object just after creating the bo (if it's a pixmap you can create a texture and viewport just after and whenever you want to use this pixmap just use the proper object id). This means that for object which are used multiple times you do object properties checking once and then takes advantage of quick reuse. Example of what object looks like is at: http://people.freedesktop.org/~glisse/rv515obj.h So what we win is fast checking, better knowledge in the kernel of a use of a bo, all this allow to add many optimization : - simple state remission optimization (don't remit state of an object if the object state are already set in the GPU) - clever flushing if a bo is only associated to texture object than kernel knows that