?readLines ... given the large size of file you may need to process chunks by 
specifying a file connection rather than a character string file name and using 
the "n" argument. 

?grepl

?Extract

?tools::showNonASCII

There are many ways for data to be corrupted... in particular when invalid 
characters appear the possibilities explode, so more specifics are needed if 
this is not enough. Of course, reading the Posting Guide, posting with plain 
text to avoid HTML corruption, and giving reproducible examples will improve 
the quality of responses to those questions. 

-- 
Sent from my phone. Please excuse my brevity.

On November 6, 2016 5:36:46 AM PST, Lucas Ferreira Mation 
<lucasmat...@gmail.com> wrote:
>I have some large .txt files about ~100GB containing a dataset in fixed
>width file. This contains some errors:
>- character characters in column that are supposed to be numeric,
>- invalid characters
>- rows with too many characters, possibly due to invalid characters or
>some
>missing end of line character (so two rows in the original data become
>one
>row in the .txt file).
>
>The errors are not very frequent, but stop me from importing with readr
>::read_fwf()
>
>
>Is there some package, or workflow, in R to pre-process the files,
>separating the valid from the not-valid rows into different files? This
>can
>be done by ETL point-click tools, such as Pentaho PDI. Is there some
>equivalent code in R to do this?
>
>I googled it and could not find a solution. I also asked this in
>StackOverflow and got no answer (here
><http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
>).
>
>regards
>Lucas Mation
>IPEA - Brasil
>
>       [[alternative HTML version deleted]]
>
>______________________________________________
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to