Hi Richard, That's my theory, though I doubt it's right. But as you say, an easy test is just to delay after issuing a sync for a couple more seconds and see if that helps. But if your PPS is a real PPS (rather than just a square wave at some vague 1s period) then I can't see what difference this would make. When that doesn't help, my inclination would be to start prodding the 10gbe control signals from software to make sure the reset / sw enables are working / see if a tge reset without a new sync behaves differently. But I can't imagine how that would be broken unless the stuff on github is out of date (which I doubt).
Jack On 27 October 2014 17:28, Richard Black <aeldstes...@gmail.com> wrote: > Jack, > > I appreciate your help. I tend to agree that the issue is likely a hardware > configuration problem, but we have been trying to match it as closely as > possible. > > We do feed a 1-PPS signal into the board, but I'm hazy on the details of the > other pulse parameters. I'll look into that as well. > > So, if I understand you correctly, you believe that the sync pulse is > reaching the ethernet interfaces after the cores are enabled? If that is the > case, couldn't we delay enabling the 10-GbE cores for another second to fix > it? This might be a quick way to test that theory, but please correct me if > I've misunderstood. > > Richard Black > > On Mon, Oct 27, 2014 at 11:05 AM, Jack Hickish <jackhick...@gmail.com> > wrote: >> >> Hi Richard, >> >> I've just had a very brief look at the design / software, so take this >> email with a pinch of salt, but on the off-chance you haven't checked >> this.... >> >> It looks like the PAPER F-engine setup on running the start script for >> software / firmware out of the box is -- >> >> 1. Disable all ethernet interfaces >> 2. Arm sync generator, wait 1 second for PPS >> 3. Reset ethernet interfaces >> 4. Enable interfaces. >> >> These four steps seem like they should be safe, yet the behaviour >> you're describing sounds like the design is midway sending a packet, >> then gets a sync, gives up sending an end-of-frame and starts sending >> a new packet, at which point the old packet + the new packet = >> overflow. >> >> Knowing that the design works for paper, my wondering is whether after >> arming the sync generator syncs are flowing through the design before >> the ethernet interface is enabled. Do you have a PPS-like input? the >> fengine initialisation script seems to wait for a second after arming, >> but if your sync input is something significantly slower, you could >> have problems. >> >> I'm sceptical about this theory (I think the symptoms would be lots of >> OK packets when you brought up the interface, and then it dying when >> the sync arrives, rather than a single good packet like you're >> seeing), but if the firmware + software really is the same as that >> working with paper, and the wiki hasn't just got out of sync with the >> paper devs, perhaps the problem is in your hardware setup.... >> >> Cheers, >> Jack >> >> On 27 October 2014 16:38, Richard Black <aeldstes...@gmail.com> wrote: >> > By "enable" port, I assume you mean the "valid" port. I've been looking >> > at >> > the PAPER model carefully for some time now, and that is how it >> > operates. It >> > has a gated valid signal with a software register on each 10-GbE core. >> > >> > Once again, this is not our model. This is one made available on the >> > CASPER >> > wiki and run without modifications. >> > >> > Richard Black >> > >> > On Mon, Oct 27, 2014 at 10:34 AM, Jason Manley <jman...@ska.ac.za> >> > wrote: >> >> >> >> I suspect the 10GbE core's input FIFO is overflowing on startup. One >> >> key >> >> thing with this core is to the ensure that your design keeps the enable >> >> port >> >> held low until the core's been configured. The core becomes unusable >> >> once >> >> the TX FIFO overflows. This has been a long-standing bug (my emails >> >> trace >> >> back to 2009) but it's so easy to work around that I don't think >> >> anyone's >> >> bothered looking into fixing it. >> >> >> >> Jason Manley >> >> CBF Manager >> >> SKA-SA >> >> >> >> Cell: +27 82 662 7726 >> >> Work: +27 21 506 7300 >> >> >> >> On 27 Oct 2014, at 18:25, Richard Black <aeldstes...@gmail.com> wrote: >> >> >> >> > Jason, >> >> > >> >> > Thanks for your comments. While I agree that changing the ADC >> >> > frequency >> >> > mid-operation is non-kosher and could result in uncertain behavior, >> >> > the >> >> > issue at hand for us is to figure out what is going on with the PAPER >> >> > model >> >> > that has been published on the CASPER wiki. This naturally won't be >> >> > (and >> >> > shouldn't be) the end-all solution to this problem. >> >> > >> >> > This is a reportedly fully-functional model that shouldn't require >> >> > any >> >> > major changes in order to operate. However, this has clearly not been >> >> > the >> >> > case in at least two independent situations (us and Peter). This begs >> >> > the >> >> > question: what's so different about our use of PAPER? >> >> > >> >> > We, at BYU, have made painstakingly sure that our IP addressing >> >> > schemes, >> >> > switch ports, and scripts are all configured correctly (thanks to >> >> > David >> >> > MacMahon for that, btw), but we still have hit the proverbial brick >> >> > wall of >> >> > 10-GbE overflow. When I last corresponded with David, he explained >> >> > that he >> >> > remembers having a similar issue before, but can't recall exactly >> >> > what the >> >> > problem was. >> >> > >> >> > In any case, the fact that by turning down the ADC clock prior to >> >> > start-up prevents the 10-GbE core from overflowing is a major lead >> >> > for us at >> >> > BYU (we've been spinning our wheels on this issue for several months >> >> > now). >> >> > By no means are we proposing mid-run ADC clock modifications, but >> >> > this >> >> > appears to be a very subtle (and quite sinister, in my opinion) bug. >> >> > >> >> > Any thoughts as to what might be going on? >> >> > >> >> > Richard Black >> >> > >> >> > On Mon, Oct 27, 2014 at 2:41 AM, Jason Manley <jman...@ska.ac.za> >> >> > wrote: >> >> > Just a note that I don't recommend you adjust FPGA clock frequencies >> >> > while it's operating. In theory, you should do a global reset in case >> >> > the >> >> > PLL/DLLs lose lock during clock transitions, in which case the logic >> >> > could >> >> > be in a uncertain state. But the Sysgen flow just does a single POR. >> >> > >> >> > A better solution might be to keep the 10GbE cores turned off (enable >> >> > line pulled low) on initialisation, until things are configured >> >> > (tgtap >> >> > started etc), and only then enable the transmission using a SW >> >> > register. >> >> > >> >> > Jason Manley >> >> > CBF Manager >> >> > SKA-SA >> >> > >> >> > Cell: +27 82 662 7726 >> >> > Work: +27 21 506 7300 >> >> > >> >> > On 25 Oct 2014, at 10:34, peter <peterniu...@163.com> wrote: >> >> > >> >> > > Hi Richard,Joe,& all, >> >> > > Thanks for your help,It finally can receive packets now! >> >> > > As you point,After enabled the ADC card and run bof >> >> > > file(./adc_init.rb >> >> > > roach1 bof file)in 200 Mhz (or higher than it), We need run init >> >> > > fengien >> >> > > script in about 75 Mhz ,(./paper_feng_init.rb roach1:0 ) ,That will >> >> > > allow >> >> > > the packet transfer. then we can turn the frequency >> >> > > higher.However the >> >> > > finally ADC clock frequency is up to 120 Mhz in my experiment.Our >> >> > > final ADC >> >> > > frequency standard is 250 Mhz. Maybe I need run the bof file in a >> >> > > higher ADC >> >> > > frequency first to make a final steady 250 Mhz ADC clock frequncy. >> >> > > Why it need init in a lower frequency and turn it up? That didn't >> >> > > make >> >> > > sense.Is the hardware going wrong?As the yellow block adc16*250-8 >> >> > > is >> >> > > designed for 250 Mhz, it should be ok for 200Mhz or 250 Mhz.How >> >> > > about the >> >> > > final frequency in your experiment? >> >> > > Any reply will be helpful! >> >> > > Best Regards! >> >> > > peter >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > At 2014-10-25 00:36:52, "Richard Black" <aeldstes...@gmail.com> >> >> > > wrote: >> >> > > Peter, >> >> > > >> >> > > That's correct. We downloaded the FPGA firmware and programmed the >> >> > > ROACH with the precompiled bitstream. When we didn't get any data >> >> > > beyond >> >> > > that single packet, we stuck some overflow status registers in the >> >> > > model and >> >> > > found that we were overflowing at 1025 64-bit words (i.e. 8200 >> >> > > bytes). >> >> > > >> >> > > We have actually found a way to get packets to flow, but it isn't a >> >> > > good fix. When we turn the ADC clock frequency down to about 75 >> >> > > MHz, the >> >> > > packets begin to flow. There is an opinion in our group that the >> >> > > 10-GbE >> >> > > buffer overflow is a transient behavior, and, hence, if we slowly >> >> > > turn up >> >> > > the clock frequency after the ROACH has started up, packets may >> >> > > continue to >> >> > > flow in steady-state operation. We haven't tested this yet, though. >> >> > > >> >> > > Richard Black >> >> > > >> >> > > On Thu, Oct 23, 2014 at 8:39 PM, peter <peterniu...@163.com> wrote: >> >> > > Hi Richard,& All, >> >> > > As you said the size of isolate packet is changing every time. ) : >> >> > > tcpdump: verbose output suppressed, use -v or -vv for full protocol >> >> > > decode >> >> > > listening on px1-2, link-type EN10MB (Ethernet), capture size 65535 >> >> > > bytes >> >> > > 10:10:55.622053 IP 10.10.2.1.8511 > 10.10.2.9.8511: UDP, length >> >> > > 4616 >> >> > > Ddi you download the PAPER gateware on the casper >> >> > > (https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest ) >> >> > > directly? How >> >> > > about the PAPER bof file run on your system? Have you met overflow >> >> > > before?I >> >> > > download and install PAPER model as the website says ,but the >> >> > > overflow >> >> > > shows when I run the paper_feng_netstat.rb. >> >> > > Thanks for your information. >> >> > > peter >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > At 2014-10-24 09:59:12, "Richard Black" <aeldstes...@gmail.com> >> >> > > wrote: >> >> > > Peter, >> >> > > >> >> > > I don't mean to hijack your thread, but we've been having a very >> >> > > similar (and time-absorbing) issue with the PAPER f-engine FPGA >> >> > > firmware >> >> > > here at BYU. Out of curiosity, does this single packet that you're >> >> > > receiving >> >> > > in tcpdump change in size every time you reprogram the ROACH? We've >> >> > > seen >> >> > > this happen, and we're pretty sure that this isolated packet is the >> >> > > 10-GbE >> >> > > buffer flushing when the 10-GbE core is initialized (i.e. the >> >> > > enable signal >> >> > > isn't sync'd with the start of new packet). >> >> > > >> >> > > Regardless of whether we have the same issue, I'm very interested >> >> > > to >> >> > > see this problem's resolution. >> >> > > >> >> > > Good luck, >> >> > > >> >> > > Richard Black >> >> > > >> >> > > On Thu, Oct 23, 2014 at 7:50 PM, peter <peterniu...@163.com> wrote: >> >> > > Hi Joe, & All, >> >> > > I find a thing this morning , there is one packet send out from >> >> > > roach >> >> > > When I run PAPER model, which I got from HPC tcpdump: >> >> > > tcpdump: verbose output suppressed, use -v or -vv for full protocol >> >> > > decode >> >> > > listening on px1-2, link-type EN10MB (Ethernet), capture size 65535 >> >> > > bytes >> >> > > 09:04:02.757813 IP 10.10.2.1.8511 > 10.10.2.9.8511: UDP, length >> >> > > 6456 >> >> > > >> >> > > The lenght is not expected 8200+8 ,and far from full TX buffer size >> >> > > 8K+512.And the other packets are stopped from overflow. >> >> > > I have tried to change the tutorial 2 packet size to 8200 bytes and >> >> > > 8K >> >> > > +512 bytes. It is a good transfer.I also make sure the boundary >> >> > > size is >> >> > > indeed 8K+512 ,because while I change size to 8K+513 byetes ,There >> >> > > is no >> >> > > data send.So the received packet this morning with length 6456 is >> >> > > totally >> >> > > under the limit.But what caused the other packets in overflow? >> >> > > Any suggestions could be helpful ! >> >> > > peter >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > At 2014-10-24 00:37:14, "Kujawski, Joseph" <jkujaw...@siena.edu> >> >> > > wrote: >> >> > > Peter, >> >> > > >> >> > > By cadence of the broadcast, I mean how often are the 8200 byte >> >> > > packets sent. Basically, I would like to determine how close your >> >> > > system is >> >> > > to the maximum data rate of the 10Gbe. >> >> > > >> >> > > Also, it would be instructive to know the following: >> >> > > >> >> > > 1) What transmission protocol are you using? (the One_GBe module >> >> > > uses >> >> > > UDP are you using that or TCP?) >> >> > > >> >> > > 2) What NICs are you using on the receive side? >> >> > > >> >> > > At this time, I am working on the theory that the issue is related >> >> > > to >> >> > > the network itself not being able to sustain the data volume you >> >> > > are >> >> > > generating and would like to get a better idea of how much data is >> >> > > generated >> >> > > and how often it is sent. >> >> > > >> >> > > Thanks, >> >> > > >> >> > > -Joe Kujawski >> >> > > >> >> > > >> >> > > >> >> > > On Thu, Oct 23, 2014 at 12:01 PM, peter <peterniu...@163.com> >> >> > > wrote: >> >> > > hi Joe, >> >> > > 1,yes ,acctually we have 3 roach2 with 8 nics. >> >> > > 2,well,each roach has 4 of 8 NICs connect directly to pc.the other >> >> > > 4 >> >> > > connect 10gb switch.I have connected the sfp wire( whitch should >> >> > > connect >> >> > > switch) to pc directly to see whwther the data come out.but no >> >> > > data out as >> >> > > for the overflow. >> >> > > 3 could you make an example about the cadence broadcast?I am not >> >> > > familiar with this. >> >> > > it indeed require bigger data,but each packet has the limited 8200 >> >> > > bytes. >> >> > > thanks for your reply! >> >> > > peter >> >> > > -- >> >> > > 发自 Android 网易邮箱 >> >> > > >> >> > > >> >> > > >> >> > > On 2014-10-23 23:16 , Kujawski, Joseph Wrote: >> >> > > >> >> > > Peter, >> >> > > >> >> > > I am downloading it now. Can you answer these questions: >> >> > > >> >> > > 1) Do you have a standard PAPER architecture with two ROACH boards >> >> > > each containing 8 10GBe ports? >> >> > > >> >> > > 2) Please describe your internet architecture i.e. how are each of >> >> > > the >> >> > > ports connected. >> >> > > >> >> > > 3) What is the cadence of each broadcast? >> >> > > >> >> > > My current suspicion is that you are generating more data than you >> >> > > can >> >> > > push through your interface(s). It may be that the higher data >> >> > > volume in >> >> > > your implementation requires more of a network infrastructure than >> >> > > was >> >> > > required byt the original system. >> >> > > >> >> > > -Joe Kujawski >> >> > > >> >> > > On Thu, Oct 23, 2014 at 11:01 AM, peter <peterniu...@163.com> >> >> > > wrote: >> >> > > This is a littel big, roach2_tl8511port is the one can send data >> >> > > normally.The environment should be ok now ,Iast time the >> >> > > crc32x64_con may be >> >> > > missing. >> >> > > Good night! >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > At 2014-10-23 22:52:54, "Kujawski, Joseph" <jkujaw...@siena.edu> >> >> > > wrote: >> >> > > Peter, >> >> > > >> >> > > 1) For reference, here is a list of the errors: >> >> > > >> >> > > --------------------------------- Version Log >> >> > > ---------------------------------- >> >> > > Version Path >> >> > > System Generator 14.6 >> >> > > C:/Xilinx/14.6/ISE_DS/ISE/sysgen >> >> > > Matlab 8.0.0.783 (R2012b) C:/MATLAB/R2012b >> >> > > ISE C:/Xilinx/14.6/ISE_DS/ISE >> >> > > >> >> > > >> >> > > -------------------------------------------------------------------------------- >> >> > > Summary of Errors: >> >> > > Error 0001: Could not find the configuration m-function >> >> > > "crc32x64_con... >> >> > > Block: >> >> > > 'roach2_fengine_tl8511port/transpose/Transpose1/crc/crc32x64' >> >> > > Error 0002: Could not find the configuration m-function >> >> > > "crc32x64_con... >> >> > > Block: >> >> > > 'roach2_fengine_tl8511port/transpose/Transpose2/crc/crc32x64' >> >> > > Error 0003: Could not find the configuration m-function >> >> > > "crc32x64_con... >> >> > > Block: >> >> > > 'roach2_fengine_tl8511port/transpose/Transpose3/crc/crc32x64' >> >> > > Error 0004: Could not find the configuration m-function >> >> > > "crc32x64_con... >> >> > > Block: >> >> > > 'roach2_fengine_tl8511port/transpose/Transpose4/crc/crc32x64' >> >> > > >> >> > > >> >> > > -------------------------------------------------------------------------------- >> >> > > >> >> > > 2) Your email did not have an attachment. I have more comments, >> >> > > but >> >> > > wanted to let you know about the attachment before you went to bed. >> >> > > >> >> > > -Joe Kujawski >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > On Thu, Oct 23, 2014 at 10:33 AM, peter <peterniu...@163.com> >> >> > > wrote: >> >> > > >> >> > > Hi Joe, >> >> > > Thanks for your warm help! >> >> > > What error shows when you compile my model?Is there some file it >> >> > > missed? I will packet my whole file to you in the attachment. And >> >> > > how about >> >> > > the PAPER one ?Did it report overflow message? It need to install >> >> > > and use >> >> > > the ruby to control it . >> >> > > Leave the PAPER model alone, Let's talk about the 10Gb block on >> >> > > roach >> >> > > v2. Though your model is good to see the Data_valid and eof etc. I >> >> > > don't >> >> > > know how to add your model to the PAPER as I realize the PAPER have >> >> > > a data >> >> > > valid and EOF according to a counter.So I don't know where to put >> >> > > the >> >> > > model.For example,if I put the data_valid or eof control process >> >> > > you >> >> > > designed on the 10Gbe port in PAPER model,then I think it equal to >> >> > > add a >> >> > > 10Gbe block instead One_GBe block in yours. *_*!! >> >> > > I change the number 50 to 1025 on tutorial 2 to make packet size to >> >> > > 8200 bytes ,And it seems good transfer without error.it is a >> >> > > frequency >> >> > > 1.3*1025. that means 1 packet send every 1.3*1025 clock.I got the >> >> > > boundary >> >> > > frequency 1.3*1025 by test a lot of times. but when I change the >> >> > > frequency >> >> > > lower than 1.3*1025,the first few packets can send out,but the >> >> > > overflow >> >> > > comes.I think it is the transfer frequency that determined the >> >> > > overflow. >> >> > > Thanks for your suggestions and advice! >> >> > > peter >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > At 2014-10-23 00:14:29, "Kujawski, Joseph" <jkujaw...@siena.edu> >> >> > > wrote: >> >> > > >> >> > > Peter, >> >> > > >> >> > > I find that I can not compile and simulate your design, however, >> >> > > looking at the code structure, I can't tell if tx_val and tx_EOF >> >> > > are high at >> >> > > the same time: >> >> > > >> >> > > >> >> > > >> >> > > Also, I modified the design to send out a packet of size 8200 once >> >> > > per >> >> > > second (model attached) and added a register that latches the GBE >> >> > > tx_aful >> >> > > and tx_overrun lines so they can be read through the KATCP >> >> > > interface. >> >> > > Modify the model to remove the oscilloscope and Xilinx out gateways >> >> > > before >> >> > > compiling it for your platform. Note that this model does not >> >> > > check for >> >> > > overflow, though the latch will let you know if you have had one. >> >> > > >> >> > > Let me know how this works for you. >> >> > > >> >> > > -Joe Kujawski >> >> > > -- >> >> > > ************************************** >> >> > > * Joe Kujawski >> >> > > * Siena College >> >> > > * Dept. of Physics and Astronomy, RB 113 >> >> > > * 515 Loudon Road >> >> > > * Loudonville, NY 12211-1462 >> >> > > * >> >> > > * Email: jkujaw...@siena.edu >> >> > > * Phone: 518-867-7509 <-- NEW NUMBER >> >> > > * Fax: 518-783-2986 >> >> > > ************************************** >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > -- >> >> > > ************************************** >> >> > > * Joe Kujawski >> >> > > * Siena College >> >> > > * Dept. of Physics and Astronomy, RB 113 >> >> > > * 515 Loudon Road >> >> > > * Loudonville, NY 12211-1462 >> >> > > * >> >> > > * Email: jkujaw...@siena.edu >> >> > > * Phone: 518-867-7509 <-- NEW NUMBER >> >> > > * Fax: 518-783-2986 >> >> > > ************************************** >> >> > > >> >> > > 从网易163邮箱发来的云附件 >> >> > > >> >> > > paperfengine.zip (126.71M, 2014年11月7日 22:58 到期) >> >> > > 下载 >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > -- >> >> > > ************************************** >> >> > > * Joe Kujawski >> >> > > * Siena College >> >> > > * Dept. of Physics and Astronomy, RB 113 >> >> > > * 515 Loudon Road >> >> > > * Loudonville, NY 12211-1462 >> >> > > * >> >> > > * Email: jkujaw...@siena.edu >> >> > > * Phone: 518-867-7509 <-- NEW NUMBER >> >> > > * Fax: 518-783-2986 >> >> > > ************************************** >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > -- >> >> > > ************************************** >> >> > > * Joe Kujawski >> >> > > * Siena College >> >> > > * Dept. of Physics and Astronomy, RB 113 >> >> > > * 515 Loudon Road >> >> > > * Loudonville, NY 12211-1462 >> >> > > * >> >> > > * Email: jkujaw...@siena.edu >> >> > > * Phone: 518-867-7509 <-- NEW NUMBER >> >> > > * Fax: 518-783-2986 >> >> > > ************************************** >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > >> >> > >> >> >> > > >