Chris: On Wed, Aug 12, 2015 at 12:16:12AM +0200, Chris Knipe wrote: > > Firstly, if the handle isn't being read with binmode set then > > perhaps the \r\n are being converted to \n (if this is > > Windows)? How are you creating/initializing the socket? > > > > Unfortunately, with or without binmode, there's no difference > to the matching (from what I can tell)
The only pratical difference should be what is considered a newline. It would only affect you if you were on Windows (though it's best practice regardless to make your code portable). If the data is truly binary you should do it regardless. If it's text then you should instead setup a proper encoding layer. To do that you need to know the exact encoding being used (check the standards, documentation, or ask the source). > Socket creation: > my $TCPSocket = new IO::Socket::INET (PeerHost => "x.x.x.x", > PeerPort => "5000", > Proto => "tcp", > Blocking => "1", #### > <-- Tried with blocking (0|1) as well. > ) or die "ERROR in Socket Creation : > $!\n"; You probably want blocking (1) since you want your read() to block and wait until there's data. Non-blocking would require a bit smarter program layout to handle it. > >From the relevant RFCs: > > The terms "NUL", "TAB", "LF", "CR, and "space" refer to the > octets %x00, %x09, %x0A, %x0D, and %x20, respectively (that > is, the octets with those codes in US-ASCII [ANSI1986] and > thus in UTF-8 [RFC3629]). The term "CRLF" or "CRLF pair" > means the sequence CR immediately followed by LF (that is, > %x0D.0A). A "printable US-ASCII character" is an octet in > the range %x21-7E. Quoted characters refer to the octets > with those codes in US-ASCII (so "." and "<" refer to %x2E > and %x3C) and will always be printable US-ASCII characters; > similarly, "digit" refers to the octets %x30-39. Which RFC(s)? > However, the data stream does contain yEnc content, which as > far as I know, is 8-bit encoding. So whilst the protocol > itself may use UTF-8, the data transmitted in the protocol can > either be UTF-8, or 8-bit To clarify, UTF-8 is an 8-bit encoding too. However, it's a variable-width encoding, meaning that the number of bytes to a character is variable. It sounds to me like this is an ASCII or Latin1 stream of text which potentially contains binary data (whether that is otherwise encoded text or image data or anything else) which is embedded into the ASCII stream as yEnc encoded data (which is ASCII compatible). I'm assuming that the protocol itself is not yEncoded, and only parts of the data clearly delimited by the protocol will be yEncoded. That shouldn't be affecting your ability to find the termination characters. > > You can attach an IO layer to the file handle by passing an > > additional argument to binmode: > > > > binmode $fh, ':encoding(UTF-8)'; > > > > > Loads, and LOADS and *piles* of UTF-8 errors... > > utf8 "\xD826" does not map to Unicode at test.pl line 40 (#1) > utf8 "\x1583F9" does not map to Unicode at test.pl line 40 (#1) > etc. That was just an example of how to specify an encoding layer for the handle. It only makes sense to specify UTF-8 if you know the stream is encoded using UTF-8... You can't really process it unless you know the encoding. It sounds to me like it should be US-ASCII, but I don't know. Do you? Can you tell us specifically which protocol this is or direct us to any documentation or RFCs? > From personal experience and using other (nasty) methods and > components for doing what I -should- be able to do with native > perl, I've learned the hard way that messing with binmode $fh, > ":encoding...." generally corrupts the 8-bit (yEnc) data. > Again, I am more than likely doing it incorrectly, but I'm > really trying to understand how to do it correctly though :-) "binmode($fh)" should never corrupt the data. It will preserve the data byte-for-byte. "binmode($fh, ':encoding(...)'" will only corrupt the data if you specify the wrong encoding. If the data is supposed to be UTF-8, but you want to tolerate errors, you should use :utf8 instead. It should not affect yEnc at all. However, that's assuming the stream is in fact UTF-8 encoded. I suspect that it's not. If it's pure US-ASCII (nothing above 127) then decoding as UTF-8 would work anyway, but it's wrong. If it happens to be latin1 or another ASCII extension then decoding as UTF-8 could indeed corrupt it. > But isn't that exactly why we set things like autoflush(1) or > $|=1? After the data stream has been sent from the server > (i.e. CRLF.CRLF) the server stops transmitting data and waits > for the next command, so there's no chance that a second data > stream may be received by the client socket, at least not until > the client socket issues a new command. Keep in mind that when reading from a socket you're not reading directly from a hard link to the data. There's an entire network of devices that the data has to travel through to arrive at your machine. Lots can happen on the network. Packets can be dropped, routed wrong, etc. The packets get reassembled in order once they arrive on your machine and only then can you read the data from them. There's no way to ensure that all of the data is there when you go to read. All you can do is read and hope. There's no guarantee the data will ever get to you. A robust program will be prepared to deal with that. One thing to note is that when you read you're throwing away partial reads of data. If the stream contains more than 512 bytes of data then you'll lose the first n - 512 bytes. If you want to slurp up the entire thing into memory and process it all at once then perhaps at a -1 argument to read() indicating that it should store the read data at the end of the buffer instead of overwriting it. > I realize my problem here is the really whacky way in which the > data stream is encoded (and that is completely out of my > control). But there must be a adequate and proper way to > handle this data. I don't think the data stream is encoded overly whacky. I think the problem is just that writing computer software is *hard*. It takes a bit of experience with something knew to wrap your head around it, and until then even things that you know how to do can go wrong. It would be helpful if you could show us a complete program that we could test for ourselves. If you can't share the whole thing then try creating a minimal example. We will figure this out...! Regards, -- Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org> Castopulence Software <https://www.castopulence.org/> Blog <http://www.bambams.ca/> perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'
signature.asc
Description: Digital signature