Re: reading from socket

Brandon McCaig Wed, 12 Aug 2015 10:53:56 -0700

Chris:

On Wed, Aug 12, 2015 at 12:16:12AM +0200, Chris Knipe wrote:
> > Firstly, if the handle isn't being read with binmode set then
> > perhaps the \r\n are being converted to \n (if this is
> > Windows)? How are you creating/initializing the socket?
> >
> 
> Unfortunately, with or without binmode, there's no difference
> to the matching (from what I can tell)


The only pratical difference should be what is considered a
newline. It would only affect you if you were on Windows (though
it's best practice regardless to make your code portable). If the
data is truly binary you should do it regardless. If it's text
then you should instead setup a proper encoding layer. To do that
you need to know the exact encoding being used (check the
standards, documentation, or ask the source).

> Socket creation:
> my $TCPSocket = new IO::Socket::INET (PeerHost => "x.x.x.x",
>                                       PeerPort => "5000",
>                                       Proto    => "tcp",
>                                       Blocking => "1",             ####
> <-- Tried with blocking (0|1) as well.
>                                      ) or die "ERROR in Socket Creation :
> $!\n";

You probably want blocking (1) since you want your read() to
block and wait until there's data. Non-blocking would require a
bit smarter program layout to handle it.

> >From the relevant RFCs:
> 
>    The terms "NUL", "TAB", "LF", "CR, and "space" refer to the
>    octets %x00, %x09, %x0A, %x0D, and %x20, respectively (that
>    is, the octets with those codes in US-ASCII [ANSI1986] and
>    thus in UTF-8 [RFC3629]). The term "CRLF" or "CRLF pair"
>    means the sequence CR immediately followed by LF (that is,
>    %x0D.0A).  A "printable US-ASCII character" is an octet in
>    the range %x21-7E.  Quoted characters refer to the octets
>    with those codes in US-ASCII (so "." and "<" refer to %x2E
>    and %x3C) and will always be printable US-ASCII characters;
>    similarly, "digit" refers to the octets %x30-39.

Which RFC(s)?

> However, the data stream does contain yEnc content, which as
> far as I know, is 8-bit encoding.  So whilst the protocol
> itself may use UTF-8, the data transmitted in the protocol can
> either be UTF-8, or 8-bit

To clarify, UTF-8 is an 8-bit encoding too. However, it's a
variable-width encoding, meaning that the number of bytes to a
character is variable.

It sounds to me like this is an ASCII or Latin1 stream of text
which potentially contains binary data (whether that is otherwise
encoded text or image data or anything else) which is embedded
into the ASCII stream as yEnc encoded data (which is ASCII
compatible).

I'm assuming that the protocol itself is not yEncoded, and only
parts of the data clearly delimited by the protocol will be
yEncoded. That shouldn't be affecting your ability to find the
termination characters.

> > You can attach an IO layer to the file handle by passing an
> > additional argument to binmode:
> >
> >     binmode $fh, ':encoding(UTF-8)';
> >
> >
> Loads, and LOADS and *piles* of UTF-8 errors...
> 
> utf8 "\xD826" does not map to Unicode at test.pl line 40 (#1)
> utf8 "\x1583F9" does not map to Unicode at test.pl line 40 (#1)
> etc.

That was just an example of how to specify an encoding layer for
the handle. It only makes sense to specify UTF-8 if you know the
stream is encoded using UTF-8... You can't really process it
unless you know the encoding. It sounds to me like it should be
US-ASCII, but I don't know. Do you? Can you tell us specifically
which protocol this is or direct us to any documentation or RFCs?

> From personal experience and using other (nasty) methods and
> components for doing what I -should- be able to do with native
> perl, I've learned the hard way that messing with binmode $fh,
> ":encoding...." generally corrupts the 8-bit (yEnc) data.
> Again, I am more than likely doing it incorrectly, but I'm
> really trying to understand how to do it correctly though :-)

"binmode($fh)" should never corrupt the data. It will preserve
the data byte-for-byte. "binmode($fh, ':encoding(...)'" will only
corrupt the data if you specify the wrong encoding. If the data
is supposed to be UTF-8, but you want to tolerate errors, you
should use :utf8 instead. It should not affect yEnc at all.
However, that's assuming the stream is in fact UTF-8 encoded. I
suspect that it's not. If it's pure US-ASCII (nothing above 127)
then decoding as UTF-8 would work anyway, but it's wrong. If it
happens to be latin1 or another ASCII extension then decoding as
UTF-8 could indeed corrupt it.

> But isn't that exactly why we set things like autoflush(1) or
> $|=1?  After the data stream has been sent from the server
> (i.e. CRLF.CRLF) the server stops transmitting data and waits
> for the next command, so there's no chance that a second data
> stream may be received by the client socket, at least not until
> the client socket issues a new command.

Keep in mind that when reading from a socket you're not reading
directly from a hard link to the data. There's an entire network
of devices that the data has to travel through to arrive at your
machine. Lots can happen on the network. Packets can be dropped,
routed wrong, etc. The packets get reassembled in order once they
arrive on your machine and only then can you read the data from
them. There's no way to ensure that all of the data is there when
you go to read. All you can do is read and hope. There's no
guarantee the data will ever get to you. A robust program will be
prepared to deal with that.

One thing to note is that when you read you're throwing away
partial reads of data. If the stream contains more than 512 bytes
of data then you'll lose the first n - 512 bytes. If you want to
slurp up the entire thing into memory and process it all at once
then perhaps at a -1 argument to read() indicating that it should
store the read data at the end of the buffer instead of
overwriting it.

> I realize my problem here is the really whacky way in which the
> data stream is encoded (and that is completely out of my
> control).  But there must be a adequate and proper way to
> handle this data.

I don't think the data stream is encoded overly whacky. I think
the problem is just that writing computer software is *hard*. It
takes a bit of experience with something knew to wrap your head
around it, and until then even things that you know how to do can
go wrong. It would be helpful if you could show us a complete
program that we could test for ourselves. If you can't share the
whole thing then try creating a minimal example. We will figure
this out...!

Regards,


-- 
Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bambams.ca/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

signature.asc
Description: Digital signature

Re: reading from socket

Reply via email to