Re: [ivtv-devel] Trac Ticket 49 help

Chris Kennedy Sun, 18 Jun 2006 15:11:57 -0700

Kirk Lewis wrote:
> You totally lost me :*( Shouldn't the PCI card only be able to write 
> to the
> regions of memory allocated for DMA and allowed by the kernel? From what
> I've read about DMA, the devices can't address just anywhere in system
> memory. Certainly, what I've been seeing doesn't look anything as 
> serious as
> random writes to system memory.
This is at the firmware level, the enc/decoding chip is a write/read 
capable bus master type device.  So it's able to write anywhere to pci 
memory or read from pci memory, transfer data basically back and forth 
from itself.  I am not fully understanding the way that all works and 
what the guy was totally talking about, it's something to do with how 
the chip works along side the pci bus, it sounds like hardware engineer 
stuff that most people don't know about in detail and definitely not 
me.  So  this is how the memory transfers are handled at a lower level, 
we just basically fill buffers able to be sent through a cards DMA 
hardware units, and then toggle some register values to point to those 
buffers.  From there the chip takes over, works with the pci bus 
directly, and those two guys totally manage the xfer, partly with the 
firmware doing work to man the transfer.  This whole process, not only 
the firmware but also the hardware units, is where the 'bugs' are.  So 
it's nothing we can get at, this is all something to do with redesigning 
the hardware (hence why there is a cx23418 lingering out there, although 
doubtful anyone will ever use it from Conexant totally redoing the chip 
without fully solving these problems because of the engineers not fuly 
understanding the hardware unit's cores (because the original engineers 
from iCompression all left with big payoffs from stock options when 
bought out, and seems they didn't document very much at all), so they 
just moved them onto the new chip in bulk thinking at least they didn't 
have to be at the mercy of the java hardware cpu bugs anymore and use 
Arm processors instead and write the firmware in C coding).  We can only 
work around the bugs, and have quite well, but it's very doubtful we can 
fully solve the inherit hardware limitations (mainly the decoder) in how 
it manages the buffer in sdram and manages the xfers to/from that 
buffer.   The DMA errors happen when basically the xfer for some reason 
fails so the pci bus and chip is in the middle of the xfer and saying 
things won't work, but by that time your often not able to do too much 
to help it.  The theory seems to be that theses xfers fail because the 
firmware is writing/reading sdram and registers and overwriting and 
looping back into itself and partly expecting that to work.  Also one 
must remember the chip has one single register to control both 
decoder/encoder DMA xfers and it doesn't have any locking method, so if 
you do the wrong thing while sending data to/from the chip, you can 
easily trigger the opposite transfer type and that register may contain 
a random value and send it to 0x000000 or anywhere else including the 
last used buffer address/hardware address  (this seems to also be hard 
to control from no atomic operations in the CPU's, so it can happen 
easily and not the users fault, they actually started doing some other 
methods of interrupting the Decoders Java CPU to try and avoid the 
encoder/decoder DMA xfer trigger from doing this, although it seems to 
be a bad hack and not always working on all pci buses).  The trick seems 
to be that when doing this on the pci bus, that wrap around will 
sometimes work, sometimes randomly corrupt the write, sometimes send off 
an xfer of random data to just some random address in your pci memory 
from the chip.  This actually is a well known fact of wrap around pci 
memory mappings, you can't write there safely without odd things 
occuring sometimes, and this chip design leaned on that bug since the 
firmware engineers seemed to take the easy way out.  It seems the Linux 
kernel sets in higher memory?, or somewhere specific which I forgot 
where he said it was exactly, where this overwrite often sends random 
data, so seems to play a game of shooting off data into your memory 
space and that will vary for the effects depending on your specific 
hardware or software running.  So basically the decoder definitely is 
buggy and the encoder was designed along side it, and has similar issues 
but the whole decoder design was never really attempted to be fixed up 
even in firmware like the encoder was.  The encoder generally works 
good, which is amazing when knowing all the details of this 
hardware/firmware combo, and seems Windows somehow got real lucky and 
avoided this problem just through where they store the kernel, even 
though ironically the whole chip prototype was done in Linux originally 
and aimed at settop boxes running Linux called the iVAC.



I wish I knew this all in a definite logical understandable explanation, 
but all I have are the fragments I've seen and from what I can tell no 
one knows the reasons fully since the original engineers ditched 
iCompression when it was sold.  The hardware engineer I know said 
basically it was hopeless mostly to make it avoid all the bugs and why 
they were working on the ARM processor  version up until Conexant laid 
them all off from wanting to cut some spending and look good to the 
stock holders :( (plus I think Conexant felt like they were redoing a 
failed chip without redoing it since the core hardware units were not 
being redone, and now days SD mpeg2 encoding is not as cutting edge as 
HD or other chips that do mpeg2 professionally way better).  So it 
really boils down to what most have done, intel motherboards of a known 
working status, funny that all of the computers they used at the 
Cupertino California office where they worked  to test these chips on 
were of an Intel base ;) and they basically said 'yes, these chips are 
very buggy on certain pci buses, it's unavoidable', laughing since they 
went through this same problem for years I guess.

Thanks,
Chris

> However, I do see how a wrap around method
> could corrupt the stream if it were to start overwritting data that 
> hadn't
> been written to the dma buffers yet. Do you have any ideas on seeing 
> if this
> is what is happening? And why would we be falling behind in the first 
> place?
> The system I am using is more than powerful enough to handle this. If 
> this
> is indeed the problem, then there is some other problem that is 
> causing the
> system to lag behind.
>
> On 6/18/06, Chris Kennedy <[EMAIL PROTECTED]> wrote:
>>
>> Kirk Lewis wrote:
>> > Thanks for the information Hans. I see some others are approaching
>> > this from
>> > a different angle as well. Hopefully they'll come up with something
>> > but I'd
>> > be more comfortable knowing what exactly is causing this and fixing it
>> in
>> > the code. You certainly pointed out several areas of interest. I'll do
>> my
>> > best to investigate them.
>> >
>> > "1) The code that sets up the DMA arrays and registers is buggy. From
>> the
>> > tests done until now it looks like that part is OK, so this cause is
>> > unlikely."
>> >
>> > This seems unlikely to me as well. If nothing changes from the failed
>> > try to
>> > the retry (and it shouldn't), then one would think that would rule 1
>> out.
>> >
>> > "2) Something goes wrong in the queue handling. This was the area 
>> where
>> I
>> > wanted to look into next. Did some partial data end up in a buffer? 
>> Was
>> > some offset modified?"
>> >
>> > I'll look at this next. It seems like something odd must be being
>> > written to
>> > the buffer. I also am going to see what happens if I simply don't 
>> try to
>> > redo the DMA.
>> >
>> >
>> > "3) It's the firmware and it does indeed require extra handling in 
>> case
>> > of a DMA error. Something of a default case if the first two do not 
>> pan
>> > out."
>> > I hope it isn't this either, but it seems likely it is a problem with
>> the
>> > DMA error handling.
>> >
>>
>> Interesting, actually the problem with DMA errors at least from the
>> decoder perspective is that it seems to involve the use of the pci wrap
>> around that is done for the chip sdram.  I don't fully understand it but
>> basically it seems in the firmware they utilize this 'feature' and at
>> the same time seem to end up writing randomly to pci memory which causes
>> the random errors on some systems (some lock completely, depends on the
>> hardware I guess).  An engineer I knew from Conexant explained it that
>> way to me, that the Linux kernel uses the pci and hardware in such a way
>> that this behavior can totally cause havoc, and Windows does something
>> different so it isn't seen there as often.  The firmware folks seemed to
>> do things like wrap around and still write to the decoder buffers
>> internally, so they thought this was an easy way to do the wrap around,
>> but it seems to randomly throw writes and reads into the general system
>> memory somehow, and the read/write pointer in the firmware can get mixed
>> up possibly.  That's at least the information I got about that, besides
>> the many Java processor bugs possibly helping that be less lucky at
>> times, he said if you try and disable the wrap around everything breaks
>> for decoding at least (maybe encoding too) since they wrote the firmware
>> exploiting this wonderful 'feature'.  (this is why the chip has 8 megs
>> of memory instead of 16, they wrapped it I guess, at least from what I
>> understand that's the way it works).
>>
>> Thanks,
>> Chris
>> >
>> >
>> > On 6/17/06, Hans Verkuil <[EMAIL PROTECTED]> wrote:
>> >>
>> >> On Saturday 17 June 2006 03:46, Kirk Lewis wrote:
>> >> > I would like to help with Trac Ticket 49, as it is currently
>> >> > affecting me. In an earlier thread Hans said he was very busy, 
>> so I'd
>> >> > like to help in any way I can.
>> >>
>> >> Great! I'm still working very hard on getting the driver into the
>> kernel
>> >> and it looks that it will take still more time than expected.
>> >>
>> >> > I'm not extremly experience with
>> >> > drivers, but I know enough to get around. Over the last few days 
>> I've
>> >> > been investigating the problem, and I haven't been able to solve 
>> it.
>> >> > So, if anyone has any suggestions, please let me know. Here is the
>> >> > information I have uncovered so far:
>> >> >
>> >> > I am running on an intel dual core processor system with a PVR 250
>> >> > (very old) and a new PVR 150. The error observed is:
>> >> >
>> >> > DMA Error 0x0000000b
>> >> >
>> >> > This is may, but does not always, cause corruption in the video
>> >> > stream. It may even cause the system to lock (although that 
>> could be
>> >> > a result of reading a badly corrupted video stream). I've found 
>> it to
>> >> > be very easily reproducable, but only by stressing the system to 
>> its
>> >> > limit. To reproduce this bug reliably I have to have 2 transcodings
>> >> > going on while recording on one tuner and watching TV on the other.
>> >> > It will occur in other cases... it just takes a lot longer.
>> >>
>> >> Correct. As far as I could tell the DMA error itself is not a 
>> problem.
>> >> It can occur on a heavily loaded system, and the driver should simply
>> >> recover from it gracefully. How often it occurs is very hardware
>> >> dependent: some chipsets have better DMA handling than others, and 
>> the
>> >> conexant MPEG encoder chip is known to have a rather finicky (read
>> >> buggy) DMA engine.
>> >>
>> >> It is the recovery from a DMA error where something goes wrong.
>> >>
>> >> > The area of interest in the code is in  dma_from_device in
>> >> > ivtv-irq.c. This is where the error is being printed out from. The
>> >> > error means there was a write error. The write error is occuring
>> >> > exactly as one would expect (if there were to be an error). It 
>> occurs
>> >> > after the dma registers are written to instructing a write. It 
>> takes
>> >> > a bit for the card to set the DMA error, so the error doesn't occur
>> >> > immediatly, it occurs in the while loop which is normally 
>> waiting to
>> >> > see the DMA started (the DMAXFER bit to flip).
>> >> >
>> >> > Here are some things I have tried:
>> >> >
>> >> > -Uncommenting the DMA_locks placed around dma_from_device 
>> elswewhere
>> >> > in ivtv-irq.c. No effect.
>> >> >
>> >> > -Moving the DMA_slock spinlock (Line 640) up to include the loops
>> >> > checking for the appropriate time to start the write. I thought 
>> this
>> >> > could be a race condition... but it has no effect. I also did the
>> >> > same in ivtv-queue.c's dma_to_device
>> >> >
>> >> > -Checking to see if the IVTV_REG_ENCDMAADDR write isn't 
>> happening. I
>> >> > noticed there is a place where the code double checks to see if the
>> >> > register was actually written to. In my testing I never saw this
>> >> > register write fail. So... that's not it.
>> >> >
>> >> > -Doing another sanity check to make absolutely sure, at the 
>> point the
>> >> > DMA registers are written, that the registers are of the expected
>> >> > value. I never saw anything unusual.
>> >> >
>> >> > -Checking to see if there is some strange pause going on in the 
>> code.
>> >> > Even with all my printks I never saw more than 2 jiffies from the
>> >> > start of the method, to the return. This is in the error case 
>> (so it
>> >> > was re-trying the DMA write).
>> >> >
>> >> > -I never saw the DMA write fail more than once in a row. It always
>> >> > succeeded (or at least no error is set).
>> >> >
>> >> > -I'm not seeing any ivtv_sleep_timeouts fail.
>> >> >
>> >> > So... I'm running out of ideas. From all appearences, there is
>> >> > nothing different before a successful write and a failed write.
>> >> > Things look the same from the point of view of the registers. The
>> >> > retry always succeeds... even though everything is the same. Every
>> >> > now and then a write appears to randomly fail, and BAM things 
>> can get
>> >> > really screwed up.
>> >> >
>> >> > One question for others. Is it normal to see DMA writes fail? Or
>> >> > should it pretty much never happen? If the former is the case then
>> >> > I'll stop trying to figure out what is causing it to fail and focus
>> >> > on trying to find a way to recover from it.
>> >>
>> >> As mentioned above, yes it is normal to see this.
>> >>
>> >> > Is there any strange implict logic going on that I'm not seeing?
>> >> > Stuff like reading a register changing it's state (or some other
>> >> > register), or any of the itv/st data structures being changed? I'm
>> >> > getting really paranoid :*(
>> >> >
>> >> > Is there something different that must be done when a DMA error
>> >> > occurs? Is it not exceptable to just clear the DMA error bits and
>> >> > retry? Are there other bits that must be reset somewhere?
>> >> >
>> >> > Thanks for any advice anyone has.
>> >>
>> >> You've pretty much done the same tests I did (and a few more) and 
>> I saw
>> >> the same things.
>> >>
>> >> Now in my opinion there are three possible causes:
>> >>
>> >> 1) The code that sets up the DMA arrays and registers is buggy. From
>> the
>> >> tests done until now it looks like that part is OK, so this cause is
>> >> unlikely.
>> >>
>> >> 2) Something goes wrong in the queue handling. This was the area 
>> where
>> I
>> >> wanted to look into next. Did some partial data end up in a 
>> buffer? Was
>> >> some offset modified?
>> >>
>> >> 3) It's the firmware and it does indeed require extra handling in 
>> case
>> >> of a DMA error. Something of a default case if the first two do 
>> not pan
>> >> out.
>> >>
>> >> I'm hoping it is 1 or 2. In that case it is a driver bug and after
>> >> fixing it everyone lives happily ever after. If it is 3, then 
>> there are
>> >> three options: first contact Chris Kennedy if he can help. He knows
>> >> more about it than anyone, even though he is no longer active with
>> >> driver development. Alternatively switch to using the mailbox command
>> >> to start the DMA. This was used in the past. For some reason it has
>> >> become linked to the pio setting so turning it on has other side
>> >> effects. Also, AFAIK the reason for abandoning that approach had 
>> to do
>> >> with bad behavior of that command when multiple streams are DMAing at
>> >> the same time (e.g. encoder, decoder, OSD).
>> >>
>> >> The third option would be to see if it is possible to discover the
>> >> precise MPEG offset and see if we can compensate for it.
>> >>
>> >> Good luck!
>> >>
>> >>         Hans
>> >>
>> >> _______________________________________________
>> >> ivtv-devel mailing list
>> >> [email protected]
>> >> http://ivtvdriver.org/mailman/listinfo/ivtv-devel
>> >>
>> >
>> > 
>> ------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > ivtv-devel mailing list
>> > [email protected]
>> > http://ivtvdriver.org/mailman/listinfo/ivtv-devel
>>
>>
>> _______________________________________________
>> ivtv-devel mailing list
>> [email protected]
>> http://ivtvdriver.org/mailman/listinfo/ivtv-devel
>>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> ivtv-devel mailing list
> [email protected]
> http://ivtvdriver.org/mailman/listinfo/ivtv-devel


_______________________________________________
ivtv-devel mailing list
[email protected]
http://ivtvdriver.org/mailman/listinfo/ivtv-devel

Re: [ivtv-devel] Trac Ticket 49 help

Reply via email to