First version of host1x intro

2012-12-07 Thread Mark Zhang
On 12/07/2012 02:44 PM, Terje Bergstr?m wrote:
> On 07.12.2012 07:38, Mark Zhang wrote:
>> On 12/06/2012 07:36 PM, Terje Bergstr?m wrote:
>>> This is about the hardware, and the correct verb is "copy". HOST1X
>>> hardware pre-fetches opcodes from push buffer and contents of GATHERs to
>>> a FIFO to overcome memory latencies. The execution happens from FIFO.
>> Okay, so the command FIFO is not a part of memory we need to allocate.
>> It's inside in every host1x clients.
> 
> Almost - it's part of HOST1X itself, not clients.

Got it.

> 
>> Yeah, I have known the idea. What I'm confused before is that why we
>> need this reloc table because all mem are allocated from us(I mean,
>> tegradrm/nvhost) so ideally we can find out all buffer related infos in
>> the driver while userspace doesn't need to provide such informations.
> 
> For lots of buffers, there's no need to map them to user space at all,
> so user space can treat the buffer as just an abstract region. For
> example, getting handle to frame buffer and passing it to 2D for
> rendering doesn't require user space to map the memory.
> 
> Second point is that as long as we haven't pinned memory, there's a
> possibility to move the pages around to do defragmenting etc. As soon as
> you get a device virtual address to a page, you need to pin it and can't
> move it anymore. So we want to build the APIs and sequences so that the
> pages can be pinned as late as possible.
> 

Great, these are another two strong evidences. Big thanks. :)

> Naturally, in CMA, the second point isn't that relevant, but I want the
> API to tolerate IOMMU support, too.
> 
> Terje
> 


First version of host1x intro

2012-12-07 Thread Mark Zhang
On 12/07/2012 02:46 AM, Stephen Warren wrote:
> On 12/06/2012 01:13 AM, Mark Zhang wrote:
[...]
>>
>> Yes, I think this is what I mean. No dummy information in the command
>> stream, userspace just fills the address which it uses(actually this is
>> cpu address of the buffer) in the command stream, and our driver must
>> have a HashTable or something which contains the buffer address pair --
>> (cpu address, dma address), so our driver can find the dma addresses for
>> every buffer then modify the addresses in the command stream.
> 
> Typically there would be no CPU address; there's no need in most cases
> to ever map most buffers to the CPU.
> 
> Automatically parsing the buffer sounds like an interesting idea, but
> then the kernel relocation code would have to know the meaning of every
> single register or command-stream "method" in order to know which of
> them take a buffer address as an argument. I am not familiar with this
> HW specifically, so perhaps it's much more regular than I think and it's
> actually easy to do that, but I imagine it'd be a PITA to implement that
> (although perhaps we have to for the command-stream validation stuff
> anyway?).

Yes. To make the driver understands all command stream stuffs is not an
easy and interesting work. And this part of codes need to be taken care
with each generation of tegra.

> Also, it'd be a lot quicker at least for larger command
> buffers to go straight to the few locations in the command stream where
> a buffer is referenced (i.e. use the side-band metadata for relocation)
> rather than having the CPU re-read the entire command stream in the
> kernel to parse it.
> 

Agree. Although we can categorize the commands and ignore the commands
which not take buffer address as arguments.



First version of host1x intro

2012-12-07 Thread Mark Zhang
On 12/06/2012 07:36 PM, Terje Bergstr?m wrote:
> On 06.12.2012 09:06, Mark Zhang wrote:
>> Thank you for the doc. So here I have questions:
>>
>> Push buffer contains a lot of opcodes for this channel. So when multiple
>> userspace processes submit jobs to this channel, all these jobs will be
>> saved in the push buffer and return, right? I mean, nvhost driver will
>> create a workqueue or something to pull stuffs out from the push buffer
>> and process them one by one?
> 
> Yes, "sync queue" contains the list of jobs that are pending (or that
> kernel thinks are pending).
>

I see.

> Push buffer in general case contains GATHER opcodes, which point to the
> streams from user space. This way we don't have to copy command streams.
> In case IOMMU is not available, we either have to copy the contents
> directly to push buffer so user space can't modify it later (I'm trying
> to implement this), or then we have to ensure the command streams cannot
> be tampered with in some other way.
> 

Yes, we have been talking about this for several days.

>> Besides, "If command DMA sees opcode GATHER, it will allocate(you missed
>> a verb here and I suppose it may be "allocate") a memory area to command
>> FIFO" -- So why we need command FIFO, this extra component? Can't we
>> just pass the correct address in push buffer to host1x clients?
> 
> This is about the hardware, and the correct verb is "copy". HOST1X
> hardware pre-fetches opcodes from push buffer and contents of GATHERs to
> a FIFO to overcome memory latencies. The execution happens from FIFO.
> 
> host1x clients don't know about push buffers. They're a feature of
> HOST1X. HOST1X just interprets the opcodes and performs the operations
> indicated, for example writes a value to a register of 2D.
> 
> In general, the FIFO is invisible to users of HOST1X, but important to
> understand when debugging stuck hardware.
> 

Okay, so the command FIFO is not a part of memory we need to allocate.
It's inside in every host1x clients.

>> And when the host1x client starts working is controlled by userspace
>> program, right? Because command DMA allocates the command FIFO when it
>> sees opcode "GATHER". Or nvhost driver will generate "GATHER" as well,
>> to buffer some opcodes then send them to host1x clients in one shot?
> 
> FIFO is hardware inside HOST1X, so it's not allocated by user space or
> kernel.
> 

Got it. We just need to copy stuffs into FIFO.

>> Could you explain more about this "relocation information"? I assume the
>> "target buffers" here mentioned are some memory saving, e.g, textures,
>> compressed video data which need to be decoded...
>> But the userspace should already allocate the memory to save them, why
>> we need to relocate?
> 
> Lucas already did a better job of explaining than I could've, so I'll
> pass. :-)
> 

Yeah, I have known the idea. What I'm confused before is that why we
need this reloc table because all mem are allocated from us(I mean,
tegradrm/nvhost) so ideally we can find out all buffer related infos in
the driver while userspace doesn't need to provide such informations.

> I'll add some notes about these to the doc.
> 

Thanks for the doc and it's really helpful.

> Terje
> 


First version of host1x intro

2012-12-07 Thread Mark Zhang
On 12/06/2012 07:17 PM, Lucas Stach wrote:
> Am Donnerstag, den 06.12.2012, 16:13 +0800 schrieb Mark Zhang:
>> On 12/06/2012 04:00 PM, Lucas Stach wrote:
> [...]
>>>
>>> Or maybe I'm misunderstanding you and you mean it the other way around.
>>> You don't let userspace dictate the addresses, the relocation
>>> information just tells the kernel to find the addresses of the
>>> referenced buffers for you and insert them, instead of the dummy
>>> information, into the command stream.
>>
>> Yes, I think this is what I mean. No dummy information in the command
>> stream, userspace just fills the address which it uses(actually this is
>> cpu address of the buffer) in the command stream, and our driver must
>> have a HashTable or something which contains the buffer address pair --
>> (cpu address, dma address), so our driver can find the dma addresses for
>> every buffer then modify the addresses in the command stream.
>>
>> Hope I explain that clear.
>>
> 
> And to do so we would have to hold an unfortunately large table in
> kernel, as a buffer can be mapped by different userspace processes at
> different locations. Also you would have to match against some variably
> sized ranges to find the correct buffer, as the userspace would have to
> pack bufferbaseaddress and offset into same value.
> 
> I really don't think it's worth the hassle and it's the right way to use
> the proven scheme of a sidebandbuffer to pass in the reloc informations.
> 

Yep, agree. Thanks for the explanation.

Mark
> Regards,
> Lucas
> 


First version of host1x intro

2012-12-07 Thread Terje Bergström
On 07.12.2012 07:38, Mark Zhang wrote:
> On 12/06/2012 07:36 PM, Terje Bergstr?m wrote:
>> This is about the hardware, and the correct verb is "copy". HOST1X
>> hardware pre-fetches opcodes from push buffer and contents of GATHERs to
>> a FIFO to overcome memory latencies. The execution happens from FIFO.
> Okay, so the command FIFO is not a part of memory we need to allocate.
> It's inside in every host1x clients.

Almost - it's part of HOST1X itself, not clients.

> Yeah, I have known the idea. What I'm confused before is that why we
> need this reloc table because all mem are allocated from us(I mean,
> tegradrm/nvhost) so ideally we can find out all buffer related infos in
> the driver while userspace doesn't need to provide such informations.

For lots of buffers, there's no need to map them to user space at all,
so user space can treat the buffer as just an abstract region. For
example, getting handle to frame buffer and passing it to 2D for
rendering doesn't require user space to map the memory.

Second point is that as long as we haven't pinned memory, there's a
possibility to move the pages around to do defragmenting etc. As soon as
you get a device virtual address to a page, you need to pin it and can't
move it anymore. So we want to build the APIs and sequences so that the
pages can be pinned as late as possible.

Naturally, in CMA, the second point isn't that relevant, but I want the
API to tolerate IOMMU support, too.

Terje


Re: First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/07/2012 02:44 PM, Terje Bergström wrote:
> On 07.12.2012 07:38, Mark Zhang wrote:
>> On 12/06/2012 07:36 PM, Terje Bergström wrote:
>>> This is about the hardware, and the correct verb is "copy". HOST1X
>>> hardware pre-fetches opcodes from push buffer and contents of GATHERs to
>>> a FIFO to overcome memory latencies. The execution happens from FIFO.
>> Okay, so the command FIFO is not a part of memory we need to allocate.
>> It's inside in every host1x clients.
> 
> Almost - it's part of HOST1X itself, not clients.

Got it.

> 
>> Yeah, I have known the idea. What I'm confused before is that why we
>> need this reloc table because all mem are allocated from us(I mean,
>> tegradrm/nvhost) so ideally we can find out all buffer related infos in
>> the driver while userspace doesn't need to provide such informations.
> 
> For lots of buffers, there's no need to map them to user space at all,
> so user space can treat the buffer as just an abstract region. For
> example, getting handle to frame buffer and passing it to 2D for
> rendering doesn't require user space to map the memory.
> 
> Second point is that as long as we haven't pinned memory, there's a
> possibility to move the pages around to do defragmenting etc. As soon as
> you get a device virtual address to a page, you need to pin it and can't
> move it anymore. So we want to build the APIs and sequences so that the
> pages can be pinned as late as possible.
> 

Great, these are another two strong evidences. Big thanks. :)

> Naturally, in CMA, the second point isn't that relevant, but I want the
> API to tolerate IOMMU support, too.
> 
> Terje
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Terje Bergström
On 07.12.2012 07:38, Mark Zhang wrote:
> On 12/06/2012 07:36 PM, Terje Bergström wrote:
>> This is about the hardware, and the correct verb is "copy". HOST1X
>> hardware pre-fetches opcodes from push buffer and contents of GATHERs to
>> a FIFO to overcome memory latencies. The execution happens from FIFO.
> Okay, so the command FIFO is not a part of memory we need to allocate.
> It's inside in every host1x clients.

Almost - it's part of HOST1X itself, not clients.

> Yeah, I have known the idea. What I'm confused before is that why we
> need this reloc table because all mem are allocated from us(I mean,
> tegradrm/nvhost) so ideally we can find out all buffer related infos in
> the driver while userspace doesn't need to provide such informations.

For lots of buffers, there's no need to map them to user space at all,
so user space can treat the buffer as just an abstract region. For
example, getting handle to frame buffer and passing it to 2D for
rendering doesn't require user space to map the memory.

Second point is that as long as we haven't pinned memory, there's a
possibility to move the pages around to do defragmenting etc. As soon as
you get a device virtual address to a page, you need to pin it and can't
move it anymore. So we want to build the APIs and sequences so that the
pages can be pinned as late as possible.

Naturally, in CMA, the second point isn't that relevant, but I want the
API to tolerate IOMMU support, too.

Terje
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/07/2012 02:46 AM, Stephen Warren wrote:
> On 12/06/2012 01:13 AM, Mark Zhang wrote:
[...]
>>
>> Yes, I think this is what I mean. No dummy information in the command
>> stream, userspace just fills the address which it uses(actually this is
>> cpu address of the buffer) in the command stream, and our driver must
>> have a HashTable or something which contains the buffer address pair --
>> (cpu address, dma address), so our driver can find the dma addresses for
>> every buffer then modify the addresses in the command stream.
> 
> Typically there would be no CPU address; there's no need in most cases
> to ever map most buffers to the CPU.
> 
> Automatically parsing the buffer sounds like an interesting idea, but
> then the kernel relocation code would have to know the meaning of every
> single register or command-stream "method" in order to know which of
> them take a buffer address as an argument. I am not familiar with this
> HW specifically, so perhaps it's much more regular than I think and it's
> actually easy to do that, but I imagine it'd be a PITA to implement that
> (although perhaps we have to for the command-stream validation stuff
> anyway?).

Yes. To make the driver understands all command stream stuffs is not an
easy and interesting work. And this part of codes need to be taken care
with each generation of tegra.

> Also, it'd be a lot quicker at least for larger command
> buffers to go straight to the few locations in the command stream where
> a buffer is referenced (i.e. use the side-band metadata for relocation)
> rather than having the CPU re-read the entire command stream in the
> kernel to parse it.
> 

Agree. Although we can categorize the commands and ignore the commands
which not take buffer address as arguments.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/06/2012 07:36 PM, Terje Bergström wrote:
> On 06.12.2012 09:06, Mark Zhang wrote:
>> Thank you for the doc. So here I have questions:
>>
>> Push buffer contains a lot of opcodes for this channel. So when multiple
>> userspace processes submit jobs to this channel, all these jobs will be
>> saved in the push buffer and return, right? I mean, nvhost driver will
>> create a workqueue or something to pull stuffs out from the push buffer
>> and process them one by one?
> 
> Yes, "sync queue" contains the list of jobs that are pending (or that
> kernel thinks are pending).
>

I see.

> Push buffer in general case contains GATHER opcodes, which point to the
> streams from user space. This way we don't have to copy command streams.
> In case IOMMU is not available, we either have to copy the contents
> directly to push buffer so user space can't modify it later (I'm trying
> to implement this), or then we have to ensure the command streams cannot
> be tampered with in some other way.
> 

Yes, we have been talking about this for several days.

>> Besides, "If command DMA sees opcode GATHER, it will allocate(you missed
>> a verb here and I suppose it may be "allocate") a memory area to command
>> FIFO" -- So why we need command FIFO, this extra component? Can't we
>> just pass the correct address in push buffer to host1x clients?
> 
> This is about the hardware, and the correct verb is "copy". HOST1X
> hardware pre-fetches opcodes from push buffer and contents of GATHERs to
> a FIFO to overcome memory latencies. The execution happens from FIFO.
> 
> host1x clients don't know about push buffers. They're a feature of
> HOST1X. HOST1X just interprets the opcodes and performs the operations
> indicated, for example writes a value to a register of 2D.
> 
> In general, the FIFO is invisible to users of HOST1X, but important to
> understand when debugging stuck hardware.
> 

Okay, so the command FIFO is not a part of memory we need to allocate.
It's inside in every host1x clients.

>> And when the host1x client starts working is controlled by userspace
>> program, right? Because command DMA allocates the command FIFO when it
>> sees opcode "GATHER". Or nvhost driver will generate "GATHER" as well,
>> to buffer some opcodes then send them to host1x clients in one shot?
> 
> FIFO is hardware inside HOST1X, so it's not allocated by user space or
> kernel.
> 

Got it. We just need to copy stuffs into FIFO.

>> Could you explain more about this "relocation information"? I assume the
>> "target buffers" here mentioned are some memory saving, e.g, textures,
>> compressed video data which need to be decoded...
>> But the userspace should already allocate the memory to save them, why
>> we need to relocate?
> 
> Lucas already did a better job of explaining than I could've, so I'll
> pass. :-)
> 

Yeah, I have known the idea. What I'm confused before is that why we
need this reloc table because all mem are allocated from us(I mean,
tegradrm/nvhost) so ideally we can find out all buffer related infos in
the driver while userspace doesn't need to provide such informations.

> I'll add some notes about these to the doc.
> 

Thanks for the doc and it's really helpful.

> Terje
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/06/2012 07:17 PM, Lucas Stach wrote:
> Am Donnerstag, den 06.12.2012, 16:13 +0800 schrieb Mark Zhang:
>> On 12/06/2012 04:00 PM, Lucas Stach wrote:
> [...]
>>>
>>> Or maybe I'm misunderstanding you and you mean it the other way around.
>>> You don't let userspace dictate the addresses, the relocation
>>> information just tells the kernel to find the addresses of the
>>> referenced buffers for you and insert them, instead of the dummy
>>> information, into the command stream.
>>
>> Yes, I think this is what I mean. No dummy information in the command
>> stream, userspace just fills the address which it uses(actually this is
>> cpu address of the buffer) in the command stream, and our driver must
>> have a HashTable or something which contains the buffer address pair --
>> (cpu address, dma address), so our driver can find the dma addresses for
>> every buffer then modify the addresses in the command stream.
>>
>> Hope I explain that clear.
>>
> 
> And to do so we would have to hold an unfortunately large table in
> kernel, as a buffer can be mapped by different userspace processes at
> different locations. Also you would have to match against some variably
> sized ranges to find the correct buffer, as the userspace would have to
> pack bufferbaseaddress and offset into same value.
> 
> I really don't think it's worth the hassle and it's the right way to use
> the proven scheme of a sidebandbuffer to pass in the reloc informations.
> 

Yep, agree. Thanks for the explanation.

Mark
> Regards,
> Lucas
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/06/2012 04:00 PM, Lucas Stach wrote:
> Am Donnerstag, den 06.12.2012, 15:49 +0800 schrieb Mark Zhang:
[...]
>>
>> OK. So these relocation addresses are used to let userspace tells kernel
>> which buffers mentioned in the command should be relocated to addresses
>> which host1x clients able to reach.
>>
> Yep, preferably all buffers referenced by a command stream should
> already be set up in such a position (CMA with Tegra2) or the relocation
> should be nothing more than setting up IOMMU page tables (Tegra3).
> 
>> I'm also wondering that, if our driver understands the stuffs in the
>> commands, maybe we can find out all addresses in the command, in that
>> way, we will not need userspace tells us which are the addresses need to
>> be relocated, right?
> 
> No. How will the kernel ever know which buffer gets referenced in a
> command stream? All the kernel sees is is a command stream with
> something like "blit data to address 0xADDR" in it. The only info that
> you can gather from that is that there must be some buffer to blit into.
> Neither do you know which buffer the stuff should be going to, nor can
> you know if you blit to offset zero in this buffer. It's perfectly valid
> to only use a subregion of a buffer.
> 
> Or maybe I'm misunderstanding you and you mean it the other way around.
> You don't let userspace dictate the addresses, the relocation
> information just tells the kernel to find the addresses of the
> referenced buffers for you and insert them, instead of the dummy
> information, into the command stream.

Yes, I think this is what I mean. No dummy information in the command
stream, userspace just fills the address which it uses(actually this is
cpu address of the buffer) in the command stream, and our driver must
have a HashTable or something which contains the buffer address pair --
(cpu address, dma address), so our driver can find the dma addresses for
every buffer then modify the addresses in the command stream.

Hope I explain that clear.

> 
> Regards,
> Lucas
> 
> 


First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/06/2012 03:20 PM, Lucas Stach wrote:
> Am Donnerstag, den 06.12.2012, 15:06 +0800 schrieb Mark Zhang:
> [...]
>>> First action taken is taking a reference to all buffers in the command
>>> stream. This includes the command stream buffers themselves, but also
>>> the target buffers. We also map each buffer to target hardware to get a
>>> device virtual address.
>>>
>>> After this, relocation information is processed. Each reference to
>>> target buffers in command stream are replaced with device virtual
>>> addresses. The relocation information contains the reference to target
>>> buffer, and to command stream to be able to do this.
>>
>> Could you explain more about this "relocation information"? I assume the
>> "target buffers" here mentioned are some memory saving, e.g, textures,
>> compressed video data which need to be decoded...
>> But the userspace should already allocate the memory to save them, why
>> we need to relocate?
>>
> "Relocation" is the term used to express the fixup of addresses in the
> command buffer. You are right, the memory is allocated and stays the
> same, but userspace can not know where in the GPU address space a
> specific buffer is bound (maybe it's even unbound at the time, when
> userspace stitches together the pushbuf). So userspace dumps some kind
> of dummy information into the command stream instead of a real buffer
> address. With the relocation information (which is kind of a sideband
> buffer to the commandbuf) it then tells the kernel to insert real GPU
> virtual addresses in the locations of the dummy info. For the kernel to
> do so, it needs to know:
> 1. where in the command stream is a dummy address
> 2. which buffers address should be inserted instead
> 3. which offset into this buffer should be added to the address to be
> inserted
> 

OK. So these relocation addresses are used to let userspace tells kernel
which buffers mentioned in the command should be relocated to addresses
which host1x clients able to reach.

I'm also wondering that, if our driver understands the stuffs in the
commands, maybe we can find out all addresses in the command, in that
way, we will not need userspace tells us which are the addresses need to
be relocated, right?

Mark
> So while processing a reloc, kernel pins buffers in memory (makes pages
> non-movable and bind them into gpu address space) and substitute all
> dummy information with real gpu virt addresses in the commandbuf.
> 
> Regards,
> Lucas
> 


First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/05/2012 05:47 PM, Terje Bergstr?m wrote:
> Hi,
> 
[...]
> 
> Channels
> 
> 
> Channel is a push buffer containing HOST1X opcodes. The push buffer
> boundaries are defined with `HOST1X_CHANNEL_DMASTART_0` and
> `HOST1X_CHANNEL_DMAEND_0`. `HOST1X_CHANNEL_DMAGET_0` indicates the next
> position within the boundaries that is going to be processes, and
> `HOST1X_CHANNEL_DMAPUT_0` indicates the position of last valid opcode.
> Whenever `HOST1X_CHANNEL_DMAPUT_0` and `HOST1X_CHANNEL_DMAGET_0` differ,
> command DMA will copy commands from push buffer to a command FIFO.
> 
> If command DMA sees opcode GATHER, it will a memory area to command
> FIFO. The number of words is indicated in GATHER opcode, and the base
> address is read from the following word. GATHERs are not recursive.
> 

Thank you for the doc. So here I have questions:

Push buffer contains a lot of opcodes for this channel. So when multiple
userspace processes submit jobs to this channel, all these jobs will be
saved in the push buffer and return, right? I mean, nvhost driver will
create a workqueue or something to pull stuffs out from the push buffer
and process them one by one?

Besides, "If command DMA sees opcode GATHER, it will allocate(you missed
a verb here and I suppose it may be "allocate") a memory area to command
FIFO" -- So why we need command FIFO, this extra component? Can't we
just pass the correct address in push buffer to host1x clients?

And when the host1x client starts working is controlled by userspace
program, right? Because command DMA allocates the command FIFO when it
sees opcode "GATHER". Or nvhost driver will generate "GATHER" as well,
to buffer some opcodes then send them to host1x clients in one shot?

> HOST1X command processor goes through the FIFO and executes opcodes.
> Each channel has some stored state, such as the client unit this channel
> is talking to. The most important opcodes are:
>
[...]
> First action taken is taking a reference to all buffers in the command
> stream. This includes the command stream buffers themselves, but also
> the target buffers. We also map each buffer to target hardware to get a
> device virtual address.
> 
> After this, relocation information is processed. Each reference to
> target buffers in command stream are replaced with device virtual
> addresses. The relocation information contains the reference to target
> buffer, and to command stream to be able to do this.

Could you explain more about this "relocation information"? I assume the
"target buffers" here mentioned are some memory saving, e.g, textures,
compressed video data which need to be decoded...
But the userspace should already allocate the memory to save them, why
we need to relocate?

> 
> After relocation, each wait is checked against expiration. Any wait
> whose threshold has already expired will be converted to a no-wait by
> writing `0x` over the word. This will essentially turn any
> expired wait into a wait for sync point register 0, value 0, and thus we
> keep sync point 0 reserved for this purpose and never change it from
> value 0.
> 
> In upstream kernel without IOMMU support we also check the contents of
> the command stream for any accesses to memory that are not taken care of
> by relocation information.
> 
[...]
> ___
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
> 


First version of host1x intro

2012-12-06 Thread Terje Bergström
On 06.12.2012 09:06, Mark Zhang wrote:
> Thank you for the doc. So here I have questions:
> 
> Push buffer contains a lot of opcodes for this channel. So when multiple
> userspace processes submit jobs to this channel, all these jobs will be
> saved in the push buffer and return, right? I mean, nvhost driver will
> create a workqueue or something to pull stuffs out from the push buffer
> and process them one by one?

Yes, "sync queue" contains the list of jobs that are pending (or that
kernel thinks are pending).

Push buffer in general case contains GATHER opcodes, which point to the
streams from user space. This way we don't have to copy command streams.
In case IOMMU is not available, we either have to copy the contents
directly to push buffer so user space can't modify it later (I'm trying
to implement this), or then we have to ensure the command streams cannot
be tampered with in some other way.

> Besides, "If command DMA sees opcode GATHER, it will allocate(you missed
> a verb here and I suppose it may be "allocate") a memory area to command
> FIFO" -- So why we need command FIFO, this extra component? Can't we
> just pass the correct address in push buffer to host1x clients?

This is about the hardware, and the correct verb is "copy". HOST1X
hardware pre-fetches opcodes from push buffer and contents of GATHERs to
a FIFO to overcome memory latencies. The execution happens from FIFO.

host1x clients don't know about push buffers. They're a feature of
HOST1X. HOST1X just interprets the opcodes and performs the operations
indicated, for example writes a value to a register of 2D.

In general, the FIFO is invisible to users of HOST1X, but important to
understand when debugging stuck hardware.

> And when the host1x client starts working is controlled by userspace
> program, right? Because command DMA allocates the command FIFO when it
> sees opcode "GATHER". Or nvhost driver will generate "GATHER" as well,
> to buffer some opcodes then send them to host1x clients in one shot?

FIFO is hardware inside HOST1X, so it's not allocated by user space or
kernel.

> Could you explain more about this "relocation information"? I assume the
> "target buffers" here mentioned are some memory saving, e.g, textures,
> compressed video data which need to be decoded...
> But the userspace should already allocate the memory to save them, why
> we need to relocate?

Lucas already did a better job of explaining than I could've, so I'll
pass. :-)

I'll add some notes about these to the doc.

Terje


First version of host1x intro

2012-12-06 Thread Lucas Stach
Am Donnerstag, den 06.12.2012, 16:13 +0800 schrieb Mark Zhang:
> On 12/06/2012 04:00 PM, Lucas Stach wrote:
[...]
> > 
> > Or maybe I'm misunderstanding you and you mean it the other way around.
> > You don't let userspace dictate the addresses, the relocation
> > information just tells the kernel to find the addresses of the
> > referenced buffers for you and insert them, instead of the dummy
> > information, into the command stream.
> 
> Yes, I think this is what I mean. No dummy information in the command
> stream, userspace just fills the address which it uses(actually this is
> cpu address of the buffer) in the command stream, and our driver must
> have a HashTable or something which contains the buffer address pair --
> (cpu address, dma address), so our driver can find the dma addresses for
> every buffer then modify the addresses in the command stream.
> 
> Hope I explain that clear.
> 

And to do so we would have to hold an unfortunately large table in
kernel, as a buffer can be mapped by different userspace processes at
different locations. Also you would have to match against some variably
sized ranges to find the correct buffer, as the userspace would have to
pack bufferbaseaddress and offset into same value.

I really don't think it's worth the hassle and it's the right way to use
the proven scheme of a sidebandbuffer to pass in the reloc informations.

Regards,
Lucas



First version of host1x intro

2012-12-06 Thread Stephen Warren
On 12/06/2012 01:13 AM, Mark Zhang wrote:
> On 12/06/2012 04:00 PM, Lucas Stach wrote:
>> Am Donnerstag, den 06.12.2012, 15:49 +0800 schrieb Mark Zhang:
> [...]
>>>
>>> OK. So these relocation addresses are used to let userspace tells kernel
>>> which buffers mentioned in the command should be relocated to addresses
>>> which host1x clients able to reach.
>>>
>> Yep, preferably all buffers referenced by a command stream should
>> already be set up in such a position (CMA with Tegra2) or the relocation
>> should be nothing more than setting up IOMMU page tables (Tegra3).
>>
>>> I'm also wondering that, if our driver understands the stuffs in the
>>> commands, maybe we can find out all addresses in the command, in that
>>> way, we will not need userspace tells us which are the addresses need to
>>> be relocated, right?
>>
>> No. How will the kernel ever know which buffer gets referenced in a
>> command stream? All the kernel sees is is a command stream with
>> something like "blit data to address 0xADDR" in it. The only info that
>> you can gather from that is that there must be some buffer to blit into.
>> Neither do you know which buffer the stuff should be going to, nor can
>> you know if you blit to offset zero in this buffer. It's perfectly valid
>> to only use a subregion of a buffer.
>>
>> Or maybe I'm misunderstanding you and you mean it the other way around.
>> You don't let userspace dictate the addresses, the relocation
>> information just tells the kernel to find the addresses of the
>> referenced buffers for you and insert them, instead of the dummy
>> information, into the command stream.
> 
> Yes, I think this is what I mean. No dummy information in the command
> stream, userspace just fills the address which it uses(actually this is
> cpu address of the buffer) in the command stream, and our driver must
> have a HashTable or something which contains the buffer address pair --
> (cpu address, dma address), so our driver can find the dma addresses for
> every buffer then modify the addresses in the command stream.

Typically there would be no CPU address; there's no need in most cases
to ever map most buffers to the CPU.

Automatically parsing the buffer sounds like an interesting idea, but
then the kernel relocation code would have to know the meaning of every
single register or command-stream "method" in order to know which of
them take a buffer address as an argument. I am not familiar with this
HW specifically, so perhaps it's much more regular than I think and it's
actually easy to do that, but I imagine it'd be a PITA to implement that
(although perhaps we have to for the command-stream validation stuff
anyway?). Also, it'd be a lot quicker at least for larger command
buffers to go straight to the few locations in the command stream where
a buffer is referenced (i.e. use the side-band metadata for relocation)
rather than having the CPU re-read the entire command stream in the
kernel to parse it.


Re: First version of host1x intro

2012-12-06 Thread Stephen Warren
On 12/06/2012 01:13 AM, Mark Zhang wrote:
> On 12/06/2012 04:00 PM, Lucas Stach wrote:
>> Am Donnerstag, den 06.12.2012, 15:49 +0800 schrieb Mark Zhang:
> [...]
>>>
>>> OK. So these relocation addresses are used to let userspace tells kernel
>>> which buffers mentioned in the command should be relocated to addresses
>>> which host1x clients able to reach.
>>>
>> Yep, preferably all buffers referenced by a command stream should
>> already be set up in such a position (CMA with Tegra2) or the relocation
>> should be nothing more than setting up IOMMU page tables (Tegra3).
>>
>>> I'm also wondering that, if our driver understands the stuffs in the
>>> commands, maybe we can find out all addresses in the command, in that
>>> way, we will not need userspace tells us which are the addresses need to
>>> be relocated, right?
>>
>> No. How will the kernel ever know which buffer gets referenced in a
>> command stream? All the kernel sees is is a command stream with
>> something like "blit data to address 0xADDR" in it. The only info that
>> you can gather from that is that there must be some buffer to blit into.
>> Neither do you know which buffer the stuff should be going to, nor can
>> you know if you blit to offset zero in this buffer. It's perfectly valid
>> to only use a subregion of a buffer.
>>
>> Or maybe I'm misunderstanding you and you mean it the other way around.
>> You don't let userspace dictate the addresses, the relocation
>> information just tells the kernel to find the addresses of the
>> referenced buffers for you and insert them, instead of the dummy
>> information, into the command stream.
> 
> Yes, I think this is what I mean. No dummy information in the command
> stream, userspace just fills the address which it uses(actually this is
> cpu address of the buffer) in the command stream, and our driver must
> have a HashTable or something which contains the buffer address pair --
> (cpu address, dma address), so our driver can find the dma addresses for
> every buffer then modify the addresses in the command stream.

Typically there would be no CPU address; there's no need in most cases
to ever map most buffers to the CPU.

Automatically parsing the buffer sounds like an interesting idea, but
then the kernel relocation code would have to know the meaning of every
single register or command-stream "method" in order to know which of
them take a buffer address as an argument. I am not familiar with this
HW specifically, so perhaps it's much more regular than I think and it's
actually easy to do that, but I imagine it'd be a PITA to implement that
(although perhaps we have to for the command-stream validation stuff
anyway?). Also, it'd be a lot quicker at least for larger command
buffers to go straight to the few locations in the command stream where
a buffer is referenced (i.e. use the side-band metadata for relocation)
rather than having the CPU re-read the entire command stream in the
kernel to parse it.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


First version of host1x intro

2012-12-06 Thread Lucas Stach
Am Donnerstag, den 06.12.2012, 15:49 +0800 schrieb Mark Zhang:
> On 12/06/2012 03:20 PM, Lucas Stach wrote:
> > Am Donnerstag, den 06.12.2012, 15:06 +0800 schrieb Mark Zhang:
> > [...]
> >>> First action taken is taking a reference to all buffers in the command
> >>> stream. This includes the command stream buffers themselves, but also
> >>> the target buffers. We also map each buffer to target hardware to get a
> >>> device virtual address.
> >>>
> >>> After this, relocation information is processed. Each reference to
> >>> target buffers in command stream are replaced with device virtual
> >>> addresses. The relocation information contains the reference to target
> >>> buffer, and to command stream to be able to do this.
> >>
> >> Could you explain more about this "relocation information"? I assume the
> >> "target buffers" here mentioned are some memory saving, e.g, textures,
> >> compressed video data which need to be decoded...
> >> But the userspace should already allocate the memory to save them, why
> >> we need to relocate?
> >>
> > "Relocation" is the term used to express the fixup of addresses in the
> > command buffer. You are right, the memory is allocated and stays the
> > same, but userspace can not know where in the GPU address space a
> > specific buffer is bound (maybe it's even unbound at the time, when
> > userspace stitches together the pushbuf). So userspace dumps some kind
> > of dummy information into the command stream instead of a real buffer
> > address. With the relocation information (which is kind of a sideband
> > buffer to the commandbuf) it then tells the kernel to insert real GPU
> > virtual addresses in the locations of the dummy info. For the kernel to
> > do so, it needs to know:
> > 1. where in the command stream is a dummy address
> > 2. which buffers address should be inserted instead
> > 3. which offset into this buffer should be added to the address to be
> > inserted
> > 
> 
> OK. So these relocation addresses are used to let userspace tells kernel
> which buffers mentioned in the command should be relocated to addresses
> which host1x clients able to reach.
> 
Yep, preferably all buffers referenced by a command stream should
already be set up in such a position (CMA with Tegra2) or the relocation
should be nothing more than setting up IOMMU page tables (Tegra3).

> I'm also wondering that, if our driver understands the stuffs in the
> commands, maybe we can find out all addresses in the command, in that
> way, we will not need userspace tells us which are the addresses need to
> be relocated, right?

No. How will the kernel ever know which buffer gets referenced in a
command stream? All the kernel sees is is a command stream with
something like "blit data to address 0xADDR" in it. The only info that
you can gather from that is that there must be some buffer to blit into.
Neither do you know which buffer the stuff should be going to, nor can
you know if you blit to offset zero in this buffer. It's perfectly valid
to only use a subregion of a buffer.

Or maybe I'm misunderstanding you and you mean it the other way around.
You don't let userspace dictate the addresses, the relocation
information just tells the kernel to find the addresses of the
referenced buffers for you and insert them, instead of the dummy
information, into the command stream.

Regards,
Lucas




First version of host1x intro

2012-12-06 Thread Lucas Stach
Am Donnerstag, den 06.12.2012, 15:06 +0800 schrieb Mark Zhang:
[...]
> > First action taken is taking a reference to all buffers in the command
> > stream. This includes the command stream buffers themselves, but also
> > the target buffers. We also map each buffer to target hardware to get a
> > device virtual address.
> > 
> > After this, relocation information is processed. Each reference to
> > target buffers in command stream are replaced with device virtual
> > addresses. The relocation information contains the reference to target
> > buffer, and to command stream to be able to do this.
> 
> Could you explain more about this "relocation information"? I assume the
> "target buffers" here mentioned are some memory saving, e.g, textures,
> compressed video data which need to be decoded...
> But the userspace should already allocate the memory to save them, why
> we need to relocate?
> 
"Relocation" is the term used to express the fixup of addresses in the
command buffer. You are right, the memory is allocated and stays the
same, but userspace can not know where in the GPU address space a
specific buffer is bound (maybe it's even unbound at the time, when
userspace stitches together the pushbuf). So userspace dumps some kind
of dummy information into the command stream instead of a real buffer
address. With the relocation information (which is kind of a sideband
buffer to the commandbuf) it then tells the kernel to insert real GPU
virtual addresses in the locations of the dummy info. For the kernel to
do so, it needs to know:
1. where in the command stream is a dummy address
2. which buffers address should be inserted instead
3. which offset into this buffer should be added to the address to be
inserted

So while processing a reloc, kernel pins buffers in memory (makes pages
non-movable and bind them into gpu address space) and substitute all
dummy information with real gpu virt addresses in the commandbuf.

Regards,
Lucas



Re: First version of host1x intro

2012-12-06 Thread Terje Bergström
On 06.12.2012 09:06, Mark Zhang wrote:
> Thank you for the doc. So here I have questions:
> 
> Push buffer contains a lot of opcodes for this channel. So when multiple
> userspace processes submit jobs to this channel, all these jobs will be
> saved in the push buffer and return, right? I mean, nvhost driver will
> create a workqueue or something to pull stuffs out from the push buffer
> and process them one by one?

Yes, "sync queue" contains the list of jobs that are pending (or that
kernel thinks are pending).

Push buffer in general case contains GATHER opcodes, which point to the
streams from user space. This way we don't have to copy command streams.
In case IOMMU is not available, we either have to copy the contents
directly to push buffer so user space can't modify it later (I'm trying
to implement this), or then we have to ensure the command streams cannot
be tampered with in some other way.

> Besides, "If command DMA sees opcode GATHER, it will allocate(you missed
> a verb here and I suppose it may be "allocate") a memory area to command
> FIFO" -- So why we need command FIFO, this extra component? Can't we
> just pass the correct address in push buffer to host1x clients?

This is about the hardware, and the correct verb is "copy". HOST1X
hardware pre-fetches opcodes from push buffer and contents of GATHERs to
a FIFO to overcome memory latencies. The execution happens from FIFO.

host1x clients don't know about push buffers. They're a feature of
HOST1X. HOST1X just interprets the opcodes and performs the operations
indicated, for example writes a value to a register of 2D.

In general, the FIFO is invisible to users of HOST1X, but important to
understand when debugging stuck hardware.

> And when the host1x client starts working is controlled by userspace
> program, right? Because command DMA allocates the command FIFO when it
> sees opcode "GATHER". Or nvhost driver will generate "GATHER" as well,
> to buffer some opcodes then send them to host1x clients in one shot?

FIFO is hardware inside HOST1X, so it's not allocated by user space or
kernel.

> Could you explain more about this "relocation information"? I assume the
> "target buffers" here mentioned are some memory saving, e.g, textures,
> compressed video data which need to be decoded...
> But the userspace should already allocate the memory to save them, why
> we need to relocate?

Lucas already did a better job of explaining than I could've, so I'll
pass. :-)

I'll add some notes about these to the doc.

Terje
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Lucas Stach
Am Donnerstag, den 06.12.2012, 16:13 +0800 schrieb Mark Zhang:
> On 12/06/2012 04:00 PM, Lucas Stach wrote:
[...]
> > 
> > Or maybe I'm misunderstanding you and you mean it the other way around.
> > You don't let userspace dictate the addresses, the relocation
> > information just tells the kernel to find the addresses of the
> > referenced buffers for you and insert them, instead of the dummy
> > information, into the command stream.
> 
> Yes, I think this is what I mean. No dummy information in the command
> stream, userspace just fills the address which it uses(actually this is
> cpu address of the buffer) in the command stream, and our driver must
> have a HashTable or something which contains the buffer address pair --
> (cpu address, dma address), so our driver can find the dma addresses for
> every buffer then modify the addresses in the command stream.
> 
> Hope I explain that clear.
> 

And to do so we would have to hold an unfortunately large table in
kernel, as a buffer can be mapped by different userspace processes at
different locations. Also you would have to match against some variably
sized ranges to find the correct buffer, as the userspace would have to
pack bufferbaseaddress and offset into same value.

I really don't think it's worth the hassle and it's the right way to use
the proven scheme of a sidebandbuffer to pass in the reloc informations.

Regards,
Lucas

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/06/2012 04:00 PM, Lucas Stach wrote:
> Am Donnerstag, den 06.12.2012, 15:49 +0800 schrieb Mark Zhang:
[...]
>>
>> OK. So these relocation addresses are used to let userspace tells kernel
>> which buffers mentioned in the command should be relocated to addresses
>> which host1x clients able to reach.
>>
> Yep, preferably all buffers referenced by a command stream should
> already be set up in such a position (CMA with Tegra2) or the relocation
> should be nothing more than setting up IOMMU page tables (Tegra3).
> 
>> I'm also wondering that, if our driver understands the stuffs in the
>> commands, maybe we can find out all addresses in the command, in that
>> way, we will not need userspace tells us which are the addresses need to
>> be relocated, right?
> 
> No. How will the kernel ever know which buffer gets referenced in a
> command stream? All the kernel sees is is a command stream with
> something like "blit data to address 0xADDR" in it. The only info that
> you can gather from that is that there must be some buffer to blit into.
> Neither do you know which buffer the stuff should be going to, nor can
> you know if you blit to offset zero in this buffer. It's perfectly valid
> to only use a subregion of a buffer.
> 
> Or maybe I'm misunderstanding you and you mean it the other way around.
> You don't let userspace dictate the addresses, the relocation
> information just tells the kernel to find the addresses of the
> referenced buffers for you and insert them, instead of the dummy
> information, into the command stream.

Yes, I think this is what I mean. No dummy information in the command
stream, userspace just fills the address which it uses(actually this is
cpu address of the buffer) in the command stream, and our driver must
have a HashTable or something which contains the buffer address pair --
(cpu address, dma address), so our driver can find the dma addresses for
every buffer then modify the addresses in the command stream.

Hope I explain that clear.

> 
> Regards,
> Lucas
> 
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Lucas Stach
Am Donnerstag, den 06.12.2012, 15:49 +0800 schrieb Mark Zhang:
> On 12/06/2012 03:20 PM, Lucas Stach wrote:
> > Am Donnerstag, den 06.12.2012, 15:06 +0800 schrieb Mark Zhang:
> > [...]
> >>> First action taken is taking a reference to all buffers in the command
> >>> stream. This includes the command stream buffers themselves, but also
> >>> the target buffers. We also map each buffer to target hardware to get a
> >>> device virtual address.
> >>>
> >>> After this, relocation information is processed. Each reference to
> >>> target buffers in command stream are replaced with device virtual
> >>> addresses. The relocation information contains the reference to target
> >>> buffer, and to command stream to be able to do this.
> >>
> >> Could you explain more about this "relocation information"? I assume the
> >> "target buffers" here mentioned are some memory saving, e.g, textures,
> >> compressed video data which need to be decoded...
> >> But the userspace should already allocate the memory to save them, why
> >> we need to relocate?
> >>
> > "Relocation" is the term used to express the fixup of addresses in the
> > command buffer. You are right, the memory is allocated and stays the
> > same, but userspace can not know where in the GPU address space a
> > specific buffer is bound (maybe it's even unbound at the time, when
> > userspace stitches together the pushbuf). So userspace dumps some kind
> > of dummy information into the command stream instead of a real buffer
> > address. With the relocation information (which is kind of a sideband
> > buffer to the commandbuf) it then tells the kernel to insert real GPU
> > virtual addresses in the locations of the dummy info. For the kernel to
> > do so, it needs to know:
> > 1. where in the command stream is a dummy address
> > 2. which buffers address should be inserted instead
> > 3. which offset into this buffer should be added to the address to be
> > inserted
> > 
> 
> OK. So these relocation addresses are used to let userspace tells kernel
> which buffers mentioned in the command should be relocated to addresses
> which host1x clients able to reach.
> 
Yep, preferably all buffers referenced by a command stream should
already be set up in such a position (CMA with Tegra2) or the relocation
should be nothing more than setting up IOMMU page tables (Tegra3).

> I'm also wondering that, if our driver understands the stuffs in the
> commands, maybe we can find out all addresses in the command, in that
> way, we will not need userspace tells us which are the addresses need to
> be relocated, right?

No. How will the kernel ever know which buffer gets referenced in a
command stream? All the kernel sees is is a command stream with
something like "blit data to address 0xADDR" in it. The only info that
you can gather from that is that there must be some buffer to blit into.
Neither do you know which buffer the stuff should be going to, nor can
you know if you blit to offset zero in this buffer. It's perfectly valid
to only use a subregion of a buffer.

Or maybe I'm misunderstanding you and you mean it the other way around.
You don't let userspace dictate the addresses, the relocation
information just tells the kernel to find the addresses of the
referenced buffers for you and insert them, instead of the dummy
information, into the command stream.

Regards,
Lucas


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/06/2012 03:20 PM, Lucas Stach wrote:
> Am Donnerstag, den 06.12.2012, 15:06 +0800 schrieb Mark Zhang:
> [...]
>>> First action taken is taking a reference to all buffers in the command
>>> stream. This includes the command stream buffers themselves, but also
>>> the target buffers. We also map each buffer to target hardware to get a
>>> device virtual address.
>>>
>>> After this, relocation information is processed. Each reference to
>>> target buffers in command stream are replaced with device virtual
>>> addresses. The relocation information contains the reference to target
>>> buffer, and to command stream to be able to do this.
>>
>> Could you explain more about this "relocation information"? I assume the
>> "target buffers" here mentioned are some memory saving, e.g, textures,
>> compressed video data which need to be decoded...
>> But the userspace should already allocate the memory to save them, why
>> we need to relocate?
>>
> "Relocation" is the term used to express the fixup of addresses in the
> command buffer. You are right, the memory is allocated and stays the
> same, but userspace can not know where in the GPU address space a
> specific buffer is bound (maybe it's even unbound at the time, when
> userspace stitches together the pushbuf). So userspace dumps some kind
> of dummy information into the command stream instead of a real buffer
> address. With the relocation information (which is kind of a sideband
> buffer to the commandbuf) it then tells the kernel to insert real GPU
> virtual addresses in the locations of the dummy info. For the kernel to
> do so, it needs to know:
> 1. where in the command stream is a dummy address
> 2. which buffers address should be inserted instead
> 3. which offset into this buffer should be added to the address to be
> inserted
> 

OK. So these relocation addresses are used to let userspace tells kernel
which buffers mentioned in the command should be relocated to addresses
which host1x clients able to reach.

I'm also wondering that, if our driver understands the stuffs in the
commands, maybe we can find out all addresses in the command, in that
way, we will not need userspace tells us which are the addresses need to
be relocated, right?

Mark
> So while processing a reloc, kernel pins buffers in memory (makes pages
> non-movable and bind them into gpu address space) and substitute all
> dummy information with real gpu virt addresses in the commandbuf.
> 
> Regards,
> Lucas
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Lucas Stach
Am Donnerstag, den 06.12.2012, 15:06 +0800 schrieb Mark Zhang:
[...]
> > First action taken is taking a reference to all buffers in the command
> > stream. This includes the command stream buffers themselves, but also
> > the target buffers. We also map each buffer to target hardware to get a
> > device virtual address.
> > 
> > After this, relocation information is processed. Each reference to
> > target buffers in command stream are replaced with device virtual
> > addresses. The relocation information contains the reference to target
> > buffer, and to command stream to be able to do this.
> 
> Could you explain more about this "relocation information"? I assume the
> "target buffers" here mentioned are some memory saving, e.g, textures,
> compressed video data which need to be decoded...
> But the userspace should already allocate the memory to save them, why
> we need to relocate?
> 
"Relocation" is the term used to express the fixup of addresses in the
command buffer. You are right, the memory is allocated and stays the
same, but userspace can not know where in the GPU address space a
specific buffer is bound (maybe it's even unbound at the time, when
userspace stitches together the pushbuf). So userspace dumps some kind
of dummy information into the command stream instead of a real buffer
address. With the relocation information (which is kind of a sideband
buffer to the commandbuf) it then tells the kernel to insert real GPU
virtual addresses in the locations of the dummy info. For the kernel to
do so, it needs to know:
1. where in the command stream is a dummy address
2. which buffers address should be inserted instead
3. which offset into this buffer should be added to the address to be
inserted

So while processing a reloc, kernel pins buffers in memory (makes pages
non-movable and bind them into gpu address space) and substitute all
dummy information with real gpu virt addresses in the commandbuf.

Regards,
Lucas

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: First version of host1x intro

2012-12-06 Thread Mark Zhang
On 12/05/2012 05:47 PM, Terje Bergström wrote:
> Hi,
> 
[...]
> 
> Channels
> 
> 
> Channel is a push buffer containing HOST1X opcodes. The push buffer
> boundaries are defined with `HOST1X_CHANNEL_DMASTART_0` and
> `HOST1X_CHANNEL_DMAEND_0`. `HOST1X_CHANNEL_DMAGET_0` indicates the next
> position within the boundaries that is going to be processes, and
> `HOST1X_CHANNEL_DMAPUT_0` indicates the position of last valid opcode.
> Whenever `HOST1X_CHANNEL_DMAPUT_0` and `HOST1X_CHANNEL_DMAGET_0` differ,
> command DMA will copy commands from push buffer to a command FIFO.
> 
> If command DMA sees opcode GATHER, it will a memory area to command
> FIFO. The number of words is indicated in GATHER opcode, and the base
> address is read from the following word. GATHERs are not recursive.
> 

Thank you for the doc. So here I have questions:

Push buffer contains a lot of opcodes for this channel. So when multiple
userspace processes submit jobs to this channel, all these jobs will be
saved in the push buffer and return, right? I mean, nvhost driver will
create a workqueue or something to pull stuffs out from the push buffer
and process them one by one?

Besides, "If command DMA sees opcode GATHER, it will allocate(you missed
a verb here and I suppose it may be "allocate") a memory area to command
FIFO" -- So why we need command FIFO, this extra component? Can't we
just pass the correct address in push buffer to host1x clients?

And when the host1x client starts working is controlled by userspace
program, right? Because command DMA allocates the command FIFO when it
sees opcode "GATHER". Or nvhost driver will generate "GATHER" as well,
to buffer some opcodes then send them to host1x clients in one shot?

> HOST1X command processor goes through the FIFO and executes opcodes.
> Each channel has some stored state, such as the client unit this channel
> is talking to. The most important opcodes are:
>
[...]
> First action taken is taking a reference to all buffers in the command
> stream. This includes the command stream buffers themselves, but also
> the target buffers. We also map each buffer to target hardware to get a
> device virtual address.
> 
> After this, relocation information is processed. Each reference to
> target buffers in command stream are replaced with device virtual
> addresses. The relocation information contains the reference to target
> buffer, and to command stream to be able to do this.

Could you explain more about this "relocation information"? I assume the
"target buffers" here mentioned are some memory saving, e.g, textures,
compressed video data which need to be decoded...
But the userspace should already allocate the memory to save them, why
we need to relocate?

> 
> After relocation, each wait is checked against expiration. Any wait
> whose threshold has already expired will be converted to a no-wait by
> writing `0x` over the word. This will essentially turn any
> expired wait into a wait for sync point register 0, value 0, and thus we
> keep sync point 0 reserved for this purpose and never change it from
> value 0.
> 
> In upstream kernel without IOMMU support we also check the contents of
> the command stream for any accesses to memory that are not taken care of
> by relocation information.
> 
[...]
> ___
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


First version of host1x intro

2012-12-05 Thread Terje Bergström
Hi,

I created a base for host1x introduction text, and pasted it into
https://gitorious.org/linux-tegra-drm/pages/Host1xIntroduction. For
convenience, I also copy it below.

As I've worked with all of this for so long, I cannot know what areas
are most interesting to you, so I just tried to put in the basics and
scope it to the features we've been discussing so far. Please point out
the features that you'd like more information on so I can add details.

2D is still totally missing from here. Everything is treated as generic
host1x clients. libdrm is touched on only briefly.

The text is written in LaTeX, and converted with pandoc. I beg for
forgiveness for any formatting oddities.

Hardware introduction
=

HOST1X is a front-end to a list of client units which deal with graphics
and multimedia. The most important features are channels for serializing
and offloading programming of the client units, and sync points for
synchronizing client units with each other, or with CPU.

Channels


Channel is a push buffer containing HOST1X opcodes. The push buffer
boundaries are defined with `HOST1X_CHANNEL_DMASTART_0` and
`HOST1X_CHANNEL_DMAEND_0`. `HOST1X_CHANNEL_DMAGET_0` indicates the next
position within the boundaries that is going to be processes, and
`HOST1X_CHANNEL_DMAPUT_0` indicates the position of last valid opcode.
Whenever `HOST1X_CHANNEL_DMAPUT_0` and `HOST1X_CHANNEL_DMAGET_0` differ,
command DMA will copy commands from push buffer to a command FIFO.

If command DMA sees opcode GATHER, it will a memory area to command
FIFO. The number of words is indicated in GATHER opcode, and the base
address is read from the following word. GATHERs are not recursive.

HOST1X command processor goes through the FIFO and executes opcodes.
Each channel has some stored state, such as the client unit this channel
is talking to. The most important opcodes are:

-   SETCL for changing the target client unit

-   IMM, INCR, NONINCR, MASK write values to registers of client unit

-   GATHER instructs command DMA to fetch from another memory area

-   RESTART instructs command DMA to start over from beginning of push
buffer

Channel class can also be HOST1X itself. Register writes to HOST1X will
invoke host class methods. The most important use is
`NV_CLASS_HOST_WAIT_SYNCPT_0`, which freezes a channel until sync point
reaches a threshold value.

Synchronization
---

A sync point is a 32-bit register in HOST1X. There are 32 sync points in
Tegra2 and Tegra3. HOST1X can be programmed to assert an interrupt when
a value higher than a pre-determined threshold is written to sync
pointer register. Each channel can also be frozen waiting for a
threshold to be reached.

Sync points are initialized to zero at boot-up, and treated as
monotonously incrementing counter with wrapping. CPU can increment a
sync point by writing the sync point id (0-31 in Tegra2 and Tegra3) to
register `HOST1X_SYNC_SYNCPT_CPU_INCR_0`. Client units all have sync
point increment method at offset 0, and the command streams request
client units to increment sync point using that. The parameters for the
increment method are condition and sync point id. Condition could be
`OP_DONE` telling to increment sync point when previous operations are
done, or `RD_DONE` indicating that client unit has finished all reads
from buffers.

Software


There are three components involved with programming HOST1X and its
client units. Linux kernel contains the drivers tegradrm and host1x.
User space library libdrm is added functionality to communicate with the
tegradrm, which communicates with host1x driver.

This text discusses only pieces relevant to HOST1X and its client units,
excluding the part about frame buffer and display controller
programming.

libdrm
==

libdrm communicates with tegradrm kernel driver to allocate buffers,
create and send command streams, synchronize.

TODO

tegradrm


tegradrm contains functionality to allocate buffers, and open channels.
The only channel available at the moment is 2D channel, which is handled
by the 2D driver inside tegradrm.

Command stream management and synchronization is passed on from 2D
driver to host1x driver. The 2D driver inside tegradrm processes the
requests from user space, and calls relevant calls in host1x.

host1x driver
=

At bootup, host1x initializes hardware. It clears sync points, and
registers interrupt handlers.

Sync points
---

Each sync point register is treated as a range. The range minimum is a
shadow copy of the sync point register, and the maximum tracks how many
increments we expect to be done. A fence is a pair (sync point id,
threshold value) indicating completion of an event of interest to
software.

Due to wrapping, software does pre-checking for each sync point wait,
whether done via HOST1X channel, or CPU. Each wait is potentially for an
already expired fence. Any wait whose threshold value lies outside the
range ]min, max]

First version of host1x intro

2012-12-05 Thread Terje Bergström
Hi,

I created a base for host1x introduction text, and pasted it into
https://gitorious.org/linux-tegra-drm/pages/Host1xIntroduction. For
convenience, I also copy it below.

As I've worked with all of this for so long, I cannot know what areas
are most interesting to you, so I just tried to put in the basics and
scope it to the features we've been discussing so far. Please point out
the features that you'd like more information on so I can add details.

2D is still totally missing from here. Everything is treated as generic
host1x clients. libdrm is touched on only briefly.

The text is written in LaTeX, and converted with pandoc. I beg for
forgiveness for any formatting oddities.

Hardware introduction
=

HOST1X is a front-end to a list of client units which deal with graphics
and multimedia. The most important features are channels for serializing
and offloading programming of the client units, and sync points for
synchronizing client units with each other, or with CPU.

Channels


Channel is a push buffer containing HOST1X opcodes. The push buffer
boundaries are defined with `HOST1X_CHANNEL_DMASTART_0` and
`HOST1X_CHANNEL_DMAEND_0`. `HOST1X_CHANNEL_DMAGET_0` indicates the next
position within the boundaries that is going to be processes, and
`HOST1X_CHANNEL_DMAPUT_0` indicates the position of last valid opcode.
Whenever `HOST1X_CHANNEL_DMAPUT_0` and `HOST1X_CHANNEL_DMAGET_0` differ,
command DMA will copy commands from push buffer to a command FIFO.

If command DMA sees opcode GATHER, it will a memory area to command
FIFO. The number of words is indicated in GATHER opcode, and the base
address is read from the following word. GATHERs are not recursive.

HOST1X command processor goes through the FIFO and executes opcodes.
Each channel has some stored state, such as the client unit this channel
is talking to. The most important opcodes are:

-   SETCL for changing the target client unit

-   IMM, INCR, NONINCR, MASK write values to registers of client unit

-   GATHER instructs command DMA to fetch from another memory area

-   RESTART instructs command DMA to start over from beginning of push
buffer

Channel class can also be HOST1X itself. Register writes to HOST1X will
invoke host class methods. The most important use is
`NV_CLASS_HOST_WAIT_SYNCPT_0`, which freezes a channel until sync point
reaches a threshold value.

Synchronization
---

A sync point is a 32-bit register in HOST1X. There are 32 sync points in
Tegra2 and Tegra3. HOST1X can be programmed to assert an interrupt when
a value higher than a pre-determined threshold is written to sync
pointer register. Each channel can also be frozen waiting for a
threshold to be reached.

Sync points are initialized to zero at boot-up, and treated as
monotonously incrementing counter with wrapping. CPU can increment a
sync point by writing the sync point id (0-31 in Tegra2 and Tegra3) to
register `HOST1X_SYNC_SYNCPT_CPU_INCR_0`. Client units all have sync
point increment method at offset 0, and the command streams request
client units to increment sync point using that. The parameters for the
increment method are condition and sync point id. Condition could be
`OP_DONE` telling to increment sync point when previous operations are
done, or `RD_DONE` indicating that client unit has finished all reads
from buffers.

Software


There are three components involved with programming HOST1X and its
client units. Linux kernel contains the drivers tegradrm and host1x.
User space library libdrm is added functionality to communicate with the
tegradrm, which communicates with host1x driver.

This text discusses only pieces relevant to HOST1X and its client units,
excluding the part about frame buffer and display controller
programming.

libdrm
==

libdrm communicates with tegradrm kernel driver to allocate buffers,
create and send command streams, synchronize.

TODO

tegradrm


tegradrm contains functionality to allocate buffers, and open channels.
The only channel available at the moment is 2D channel, which is handled
by the 2D driver inside tegradrm.

Command stream management and synchronization is passed on from 2D
driver to host1x driver. The 2D driver inside tegradrm processes the
requests from user space, and calls relevant calls in host1x.

host1x driver
=

At bootup, host1x initializes hardware. It clears sync points, and
registers interrupt handlers.

Sync points
---

Each sync point register is treated as a range. The range minimum is a
shadow copy of the sync point register, and the maximum tracks how many
increments we expect to be done. A fence is a pair (sync point id,
threshold value) indicating completion of an event of interest to
software.

Due to wrapping, software does pre-checking for each sync point wait,
whether done via HOST1X channel, or CPU. Each wait is potentially for an
already expired fence. Any wait whose threshold value lies outside the
range ]min, max]