Re: [casper] Dropped packets during HASHPIPE data acquisition
Also, I tried to condense/summarize the issue so if you would like additional details, please feel free to ask and I'll provide them. Thanks again, Mark Ruzindana On Tue, Dec 15, 2020 at 10:00 PM Mark Ruzindana wrote: > Hi all, > > While running hashpipe with the intention of debugging using gdb as > suggested, I failed to replicate my segfault issue. On one hand, it should > have been working given what I understand about the packet socket > implementation and the way that I wrote the code, but on the other, I don't > know why it works now, and not before because I didn't make any changes > between runs. It's a stretch, but there were a few reboots and improvements > in cable organization within the rack, but that's about it. > > I'm taking note of the following change for documentation purposes. It's > not the reason for my issue. Feel free to ignore or comment on it. This > change was made before and remained after I observed the segfault issue. To > flush the packets in the port before the thread is run, I am using " > p_frame=hashpipe_pktsock_recv_udp_frame_nonblock(p_ps, bindport)" instead > of "p_frame=hashpipe_pktsock_recv_frame_nonblock(p_ps, bindport)" in the > while loop, otherwise, there's an infinite loop because there are packets > with other protocols constantly being captured by the port. > > I'm hoping I figure out what change was made as I am debugging the rest of > this, but for now the specific segfault that I was having is no longer an > issue. It's unsatisfying and I'll come back to it if I don't figure it out > as I go, but for now, I'm moving on. > > Okay, so now, I'm still experiencing dropped packets. Given a kernel page > size of 4096 bytes and a frame size of 16384 bytes, I have tried buffer > parameters ranging from, 480 to 128000 total number of frames and 60 to > 1000 blocks respectively. With improvements in throughput in one instance, > but not the other three that I have running. The one instance with > improvements, on the upper end of that range, exceeds the number of packets > expected in a hashpipe shared memory buffer block (the ring buffers in > between threads), but only for about four or so of them at the very > beginning of a scan. No dropped packets for the rest of the scan. While the > other instances, with no recognizable improvements, drop packets through > out the scan with one of them dropping significantly more than the other > two. > > I'm currently trying a few things to debug this, but I figured that I > would ask sooner rather than later. Is there a configuration or step that I > may have missed in the implementation of packet sockets? My understanding > is that it should handle my current data rates with no problem. So with > multiple instances running (four in my case), I should be able to capture > data with 0 dropped packets (100% data throughput). > > Just a note, with a packet size of 8168 bytes, and a frame size of 8192 > bytes, hashpipe was crashing, but in a completely unrelated way to how it > did before. It was *not* a segfault after capturing the exact number of > packets that correspond to the number of frames in the packet socket ring > buffer as I described in previous emails. The crashes were more > inconsistent and I think it's because the frame size needs to be > considerably larger than the packet size. An order of 2 seemed to be > enough. I currently have the frame size set to 16384 (also a multiple of > the kernel page size), and do not have an issue with hashpipe crashing. > > Let me know if you have any thoughts and suggestions. I really appreciate > the help. > > Thanks, > > Mark Ruzindana > > On Thu, Dec 3, 2020 at 11:16 AM Mark Ruzindana > wrote: > >> Thanks for the suggestion David! >> >> I was starting hashpipe in the debugger. I'll use gdb and the core file, >> and let you know what I find. If I still can't figure out the problem, I >> will send you a minimum non-working example. I definitely think it's some >> sort of pointer arithmetic error as well, I just can't see it yet. I really >> appreciate the help. >> >> Thanks again, >> >> Mark >> >> On Thu, Dec 3, 2020 at 1:30 AM David MacMahon >> wrote: >> >>> Hi, Mark, >>> >>> Sorry to hear you're still getting a segfault. It sounds like you made >>> some progress with gdb, but the fact that you ended up with a different >>> sort of error suggests that you were starting hashpipe in the debugger. To >>> debug your initial segfault problem, you can run hashpipe without the >>> debugger, let it segfault and generate a core file, then use gdb and the >>> core file (and hashpipe) to examine the state of the program when the >>> segfault occurred. The tricky part is getting the core file to be >>> generated on a segfault. You typically have to increase the core file size >>> limit using "ulimit -c unlimited" and (because hashpipe is typically >>> installed with the suid bit set) you have to let the kernel know it's OK to >>> dump core files for suid programs using "sudo
Re: [casper] Dropped packets during HASHPIPE data acquisition
Hi all, While running hashpipe with the intention of debugging using gdb as suggested, I failed to replicate my segfault issue. On one hand, it should have been working given what I understand about the packet socket implementation and the way that I wrote the code, but on the other, I don't know why it works now, and not before because I didn't make any changes between runs. It's a stretch, but there were a few reboots and improvements in cable organization within the rack, but that's about it. I'm taking note of the following change for documentation purposes. It's not the reason for my issue. Feel free to ignore or comment on it. This change was made before and remained after I observed the segfault issue. To flush the packets in the port before the thread is run, I am using "p_frame= hashpipe_pktsock_recv_udp_frame_nonblock(p_ps, bindport)" instead of " p_frame=hashpipe_pktsock_recv_frame_nonblock(p_ps, bindport)" in the while loop, otherwise, there's an infinite loop because there are packets with other protocols constantly being captured by the port. I'm hoping I figure out what change was made as I am debugging the rest of this, but for now the specific segfault that I was having is no longer an issue. It's unsatisfying and I'll come back to it if I don't figure it out as I go, but for now, I'm moving on. Okay, so now, I'm still experiencing dropped packets. Given a kernel page size of 4096 bytes and a frame size of 16384 bytes, I have tried buffer parameters ranging from, 480 to 128000 total number of frames and 60 to 1000 blocks respectively. With improvements in throughput in one instance, but not the other three that I have running. The one instance with improvements, on the upper end of that range, exceeds the number of packets expected in a hashpipe shared memory buffer block (the ring buffers in between threads), but only for about four or so of them at the very beginning of a scan. No dropped packets for the rest of the scan. While the other instances, with no recognizable improvements, drop packets through out the scan with one of them dropping significantly more than the other two. I'm currently trying a few things to debug this, but I figured that I would ask sooner rather than later. Is there a configuration or step that I may have missed in the implementation of packet sockets? My understanding is that it should handle my current data rates with no problem. So with multiple instances running (four in my case), I should be able to capture data with 0 dropped packets (100% data throughput). Just a note, with a packet size of 8168 bytes, and a frame size of 8192 bytes, hashpipe was crashing, but in a completely unrelated way to how it did before. It was *not* a segfault after capturing the exact number of packets that correspond to the number of frames in the packet socket ring buffer as I described in previous emails. The crashes were more inconsistent and I think it's because the frame size needs to be considerably larger than the packet size. An order of 2 seemed to be enough. I currently have the frame size set to 16384 (also a multiple of the kernel page size), and do not have an issue with hashpipe crashing. Let me know if you have any thoughts and suggestions. I really appreciate the help. Thanks, Mark Ruzindana On Thu, Dec 3, 2020 at 11:16 AM Mark Ruzindana wrote: > Thanks for the suggestion David! > > I was starting hashpipe in the debugger. I'll use gdb and the core file, > and let you know what I find. If I still can't figure out the problem, I > will send you a minimum non-working example. I definitely think it's some > sort of pointer arithmetic error as well, I just can't see it yet. I really > appreciate the help. > > Thanks again, > > Mark > > On Thu, Dec 3, 2020 at 1:30 AM David MacMahon wrote: > >> Hi, Mark, >> >> Sorry to hear you're still getting a segfault. It sounds like you made >> some progress with gdb, but the fact that you ended up with a different >> sort of error suggests that you were starting hashpipe in the debugger. To >> debug your initial segfault problem, you can run hashpipe without the >> debugger, let it segfault and generate a core file, then use gdb and the >> core file (and hashpipe) to examine the state of the program when the >> segfault occurred. The tricky part is getting the core file to be >> generated on a segfault. You typically have to increase the core file size >> limit using "ulimit -c unlimited" and (because hashpipe is typically >> installed with the suid bit set) you have to let the kernel know it's OK to >> dump core files for suid programs using "sudo sysctl -w fs.suid_dumpable=1" >> (or maybe 2 if 1 doesn't quite do it). You can read more about these steps >> with "help ulimit" (ulimit is a bash builtin) and "man 5 proc". >> >> Once you have the core file (typically named "core" but it may have a >> numeric extension from the PID of the crashing process) you can debug >> things with "gbd /path/to/hashpipe