Re: [casper] Dropped packets during HASHPIPE data acquisition

2020-12-15 Thread Mark Ruzindana
Also, I tried to condense/summarize the issue so if you would like
additional details, please feel free to ask and I'll provide them.

Thanks again,

Mark Ruzindana

On Tue, Dec 15, 2020 at 10:00 PM Mark Ruzindana  wrote:

> Hi all,
>
> While running hashpipe with the intention of debugging using gdb as
> suggested, I failed to replicate my segfault issue. On one hand, it should
> have been working given what I understand about the packet socket
> implementation and the way that I wrote the code, but on the other, I don't
> know why it works now, and not before because I didn't make any changes
> between runs. It's a stretch, but there were a few reboots and improvements
> in cable organization within the rack, but that's about it.
>
> I'm taking note of the following change for documentation purposes. It's
> not the reason for my issue. Feel free to ignore or comment on it. This
> change was made before and remained after I observed the segfault issue. To
> flush the packets in the port before the thread is run, I am using "
> p_frame=hashpipe_pktsock_recv_udp_frame_nonblock(p_ps, bindport)" instead
> of "p_frame=hashpipe_pktsock_recv_frame_nonblock(p_ps, bindport)" in the
> while loop, otherwise, there's an infinite loop because there are packets
> with other protocols constantly being captured by the port.
>
> I'm hoping I figure out what change was made as I am debugging the rest of
> this, but for now the specific segfault that I was having is no longer an
> issue. It's unsatisfying and I'll come back to it if I don't figure it out
> as I go, but for now, I'm moving on.
>
> Okay, so now, I'm still experiencing dropped packets. Given a kernel page
> size of 4096 bytes and a frame size of 16384 bytes, I have tried buffer
> parameters ranging from, 480 to 128000 total number of frames and 60 to
> 1000 blocks respectively. With improvements in throughput in one instance,
> but not the other three that I have running. The one instance with
> improvements, on the upper end of that range, exceeds the number of packets
> expected in a hashpipe shared memory buffer block (the ring buffers in
> between threads), but only for about four or so of them at the very
> beginning of a scan. No dropped packets for the rest of the scan. While the
> other instances, with no recognizable improvements, drop packets through
> out the scan with one of them dropping significantly more than the other
> two.
>
> I'm currently trying a few things to debug this, but I figured that I
> would ask sooner rather than later. Is there a configuration or step that I
> may have missed in the implementation of packet sockets? My understanding
> is that it should handle my current data rates with no problem. So with
> multiple instances running (four in my case), I should be able to capture
> data with 0 dropped packets (100% data throughput).
>
> Just a note, with a packet size of 8168 bytes, and a frame size of 8192
> bytes, hashpipe was crashing, but in a completely unrelated way to how it
> did before. It was *not* a segfault after capturing the exact number of
> packets that correspond to the number of frames in the packet socket ring
> buffer as I described in previous emails. The crashes were more
> inconsistent and I think it's because the frame size needs to be
> considerably larger than the packet size. An order of 2 seemed to be
> enough. I currently have the frame size set to 16384 (also a multiple of
> the kernel page size), and do not have an issue with hashpipe crashing.
>
> Let me know if you have any thoughts and suggestions. I really appreciate
> the help.
>
> Thanks,
>
> Mark Ruzindana
>
> On Thu, Dec 3, 2020 at 11:16 AM Mark Ruzindana 
> wrote:
>
>> Thanks for the suggestion David!
>>
>> I was starting hashpipe in the debugger. I'll use gdb and the core file,
>> and let you know what I find. If I still can't figure out the problem, I
>> will send you a minimum non-working example. I definitely think it's some
>> sort of pointer arithmetic error as well, I just can't see it yet. I really
>> appreciate the help.
>>
>> Thanks again,
>>
>> Mark
>>
>> On Thu, Dec 3, 2020 at 1:30 AM David MacMahon 
>> wrote:
>>
>>> Hi, Mark,
>>>
>>> Sorry to hear you're still getting a segfault.  It sounds like you made
>>> some progress with gdb, but the fact that you ended up with a different
>>> sort of error suggests that you were starting hashpipe in the debugger.  To
>>> debug your initial segfault problem, you can run hashpipe without the
>>> debugger, let it segfault and generate a core file, then use gdb and the
>>> core file (and hashpipe) to examine the state of the program when the
>>> segfault occurred.  The tricky part is getting the core file to be
>>> generated on a segfault.  You typically have to increase the core file size
>>> limit using "ulimit -c unlimited" and (because hashpipe is typically
>>> installed with the suid bit set) you have to let the kernel know it's OK to
>>> dump core files for suid programs using "sudo 

Re: [casper] Dropped packets during HASHPIPE data acquisition

2020-12-15 Thread Mark Ruzindana
Hi all,

While running hashpipe with the intention of debugging using gdb as
suggested, I failed to replicate my segfault issue. On one hand, it should
have been working given what I understand about the packet socket
implementation and the way that I wrote the code, but on the other, I don't
know why it works now, and not before because I didn't make any changes
between runs. It's a stretch, but there were a few reboots and improvements
in cable organization within the rack, but that's about it.

I'm taking note of the following change for documentation purposes. It's
not the reason for my issue. Feel free to ignore or comment on it. This
change was made before and remained after I observed the segfault issue. To
flush the packets in the port before the thread is run, I am using "p_frame=
hashpipe_pktsock_recv_udp_frame_nonblock(p_ps, bindport)" instead of "
p_frame=hashpipe_pktsock_recv_frame_nonblock(p_ps, bindport)" in the while
loop, otherwise, there's an infinite loop because there are packets with
other protocols constantly being captured by the port.

I'm hoping I figure out what change was made as I am debugging the rest of
this, but for now the specific segfault that I was having is no longer an
issue. It's unsatisfying and I'll come back to it if I don't figure it out
as I go, but for now, I'm moving on.

Okay, so now, I'm still experiencing dropped packets. Given a kernel page
size of 4096 bytes and a frame size of 16384 bytes, I have tried buffer
parameters ranging from, 480 to 128000 total number of frames and 60 to
1000 blocks respectively. With improvements in throughput in one instance,
but not the other three that I have running. The one instance with
improvements, on the upper end of that range, exceeds the number of packets
expected in a hashpipe shared memory buffer block (the ring buffers in
between threads), but only for about four or so of them at the very
beginning of a scan. No dropped packets for the rest of the scan. While the
other instances, with no recognizable improvements, drop packets through
out the scan with one of them dropping significantly more than the other
two.

I'm currently trying a few things to debug this, but I figured that I would
ask sooner rather than later. Is there a configuration or step that I may
have missed in the implementation of packet sockets? My understanding is
that it should handle my current data rates with no problem. So with
multiple instances running (four in my case), I should be able to capture
data with 0 dropped packets (100% data throughput).

Just a note, with a packet size of 8168 bytes, and a frame size of 8192
bytes, hashpipe was crashing, but in a completely unrelated way to how it
did before. It was *not* a segfault after capturing the exact number of
packets that correspond to the number of frames in the packet socket ring
buffer as I described in previous emails. The crashes were more
inconsistent and I think it's because the frame size needs to be
considerably larger than the packet size. An order of 2 seemed to be
enough. I currently have the frame size set to 16384 (also a multiple of
the kernel page size), and do not have an issue with hashpipe crashing.

Let me know if you have any thoughts and suggestions. I really appreciate
the help.

Thanks,

Mark Ruzindana

On Thu, Dec 3, 2020 at 11:16 AM Mark Ruzindana  wrote:

> Thanks for the suggestion David!
>
> I was starting hashpipe in the debugger. I'll use gdb and the core file,
> and let you know what I find. If I still can't figure out the problem, I
> will send you a minimum non-working example. I definitely think it's some
> sort of pointer arithmetic error as well, I just can't see it yet. I really
> appreciate the help.
>
> Thanks again,
>
> Mark
>
> On Thu, Dec 3, 2020 at 1:30 AM David MacMahon  wrote:
>
>> Hi, Mark,
>>
>> Sorry to hear you're still getting a segfault.  It sounds like you made
>> some progress with gdb, but the fact that you ended up with a different
>> sort of error suggests that you were starting hashpipe in the debugger.  To
>> debug your initial segfault problem, you can run hashpipe without the
>> debugger, let it segfault and generate a core file, then use gdb and the
>> core file (and hashpipe) to examine the state of the program when the
>> segfault occurred.  The tricky part is getting the core file to be
>> generated on a segfault.  You typically have to increase the core file size
>> limit using "ulimit -c unlimited" and (because hashpipe is typically
>> installed with the suid bit set) you have to let the kernel know it's OK to
>> dump core files for suid programs using "sudo sysctl -w fs.suid_dumpable=1"
>> (or maybe 2 if 1 doesn't quite do it).  You can read more about these steps
>> with "help ulimit" (ulimit is a bash builtin) and "man 5 proc".
>>
>> Once you have the core file (typically named "core" but it may have a
>> numeric extension from the PID of the crashing process) you can debug
>> things with "gbd /path/to/hashpipe