Well, we're apparently still carrying around old archived kettle code which hasn't been ported to Hop yet. I'm in favor or getting rid of it since it's still available elsewhere. Same goes for the old samples. So that should clear out most of the ignored code and files.
https://issues.apache.org/jira/browse/HOP-2335 : remove archive-samples https://issues.apache.org/jira/browse/HOP-2336 : Remove the archive-pipeline-transforms folder On the other hand we'll be building up integration tests since we do want to do things better than before. These tests will indeed use very old FoxPro files to check if these .dbf files are still being read as they should. You'd be surprised how many of those are still around. https://issues.apache.org/jira/browse/HOP-2325 : .properties files https://issues.apache.org/jira/browse/HOP-2326 : .sh and .bat files https://issues.apache.org/jira/browse/HOP-2327 : .xml files Those 3 cover over 4000 files so that's that. So your logic makes a lot of sense. We'll continue to exclude files like SVG and indeed Hop Pipelines and Workflows (all XML variants but considered binary files). Cheers, Matt On Sun, Dec 20, 2020 at 7:32 PM Julian Hyde <[email protected]> wrote: > > > > On Dec 20, 2020, at 1:18 AM, Matt Casters <[email protected]> > wrote: > > > > Thank you very much Julian. > > I mainly wonder where on earth that font comes from since we're not using > > it anywhere. > > Yeah, fonts have a habit of sneaking in. :) > > > As for rat exclusions: are there any particular file formats besides > .java > > files that need an Apache license header? We'd be happy to add them > > elsewhere. > > The shell scripts perhaps as they support comments? We could even add > them > > to the SVG filed even though it will probably blow up memory consumption > > unless we code the comments out of the file loads somehow. > > Perhaps it's easier to just look at other projects and ask which files > need > > a header? > > My preference is to put a header on pretty much any file that can have a > header. Which in my experience is pretty much all text files, except those > used as test inputs or reference logs. For example, in .md files you can > add the header inside comments that do not appear in the generated HTML. > Shell scripts, pom files, properties files, etc. all support comments, so > we should add headers. > > I agree, I would not put a header on SVG files because they are treated as > de facto binaries and they need to be small. > > I suggest that for 0.60 we pare down the RAT exclusions to the absolute > minimum. RAT is a powerful tool if we are not holding it back! I ran RAT > with the -debug flag and I saw lots of Java files being excluded, and that > was concerning. > > Binary files are always a problem. They are just as susceptible to > copyright and licensing issues but are more difficult to audit. One > strategy is to audit them one by one and add an exclusion line for each > individual file. I know that’s a big task, so definitely not for 0.50. > > By the way, I ran a command to find out what kinds of files are in Hop. > The results are interesting. There’s even one FoxPro file in there!: > > $ git ls-files -z | xargs -0 file -b | sort | uniq -c > 2827 ASCII text > 9 ASCII text, with CRLF, LF line terminators > 47 ASCII text, with CRLF line terminators > 3 ASCII text, with CR line terminators > 16 ASCII text, with no line terminators > 428 ASCII text, with very long lines > 2 Big-endian UTF-16 Unicode text, with no line terminators > 7 Bourne-Again shell script, ASCII text executable > 1 Bourne-Again shell script, ASCII text executable, with very long > lines > 2 bzip2 compressed data, block size = 900k > 2 Composite Document File V2 Document, Little Endian, Os: Windows, > Version 10.0, Code page: 1252, Author: Matthias Hietland Heie, Last Saved > By: Sergio Ribeiro, Name of Creating Application: Microsoft Excel, Create > Time/Date: Fri Nov 17 14:48:53 2017, Last Saved Time/Date: Tue Jun 18 > 09:34:04 2019, Security: 0 > 2 Composite Document File V2 Document, Little Endian, Os: Windows, > Version 10.0, Code page: 1252, Author: Sergio Ribeiro, Last Saved By: > Sergio Ribeiro, Name of Creating Application: Microsoft Excel, Create > Time/Date: Tue Sep 11 09:41:24 2018, Last Saved Time/Date: Tue Sep 11 > 10:20:56 2018, Security: 0 > 2 Composite Document File V2 Document, Little Endian, Os: Windows, > Version 10.0, Code page: 1252, Author: Sergio Ribeiro, Last Saved By: > Sergio Ribeiro, Name of Creating Application: Microsoft Excel, Create > Time/Date: Tue Sep 11 09:41:24 2018, Last Saved Time/Date: Tue Sep 11 > 10:55:49 2018, Security: 0 > 2 Composite Document File V2 Document, Little Endian, Os: Windows, > Version 1.0, Code page: -535, Author: JB, Revision Number: 3, Total Editing > Time: 02:08, Create Time/Date: Thu Oct 27 19:46:23 2011, Last Saved > Time/Date: Thu Feb 20 09:00:44 2014 > 2 Composite Document File V2 Document, Little Endian, Os: Windows, > Version 5.0, Code page: 0 > 1 Composite Document File V2 Document, Little Endian, Os: Windows, > Version 5.0, Code page: 1252, Author: Jens Bleuel, Last Saved By: Jens > Bleuel, Name of Creating Application: Microsoft Excel, Create Time/Date: > Wed Aug 23 15:46:56 2006, Last Saved Time/Date: Wed Aug 23 15:56:14 2006, > Security: 0 > 1 Composite Document File V2 Document, Little Endian, Os: Windows, > Version 5.1, Code page: 1252, Author: Matt Casters, Last Saved By: Matt > Casters, Name of Creating Application: Microsoft Excel, Create Time/Date: > Tue Sep 7 16:08:18 2010, Last Saved Time/Date: Tue Sep 7 16:15:32 2010, > Security: 0 > 2 Composite Document File V2 Document, Little Endian, Os: Windows, > Version 5.1, Code page: 1252, Last Saved By: Jens Bleuel, Name of Creating > Application: Microsoft Excel, Create Time/Date: Thu Oct 17 06:27:31 1996, > Last Saved Time/Date: Tue Nov 28 15:07:48 2006, Security: 0 > 5 C source, ASCII text > 7 C++ source, ASCII text > 25 CSV text > 1 data > 3 DOS batch file, ASCII text > 1 Embedded OpenType (EOT), icomoon family > 1 Embedded OpenType (EOT), OpenSansLight family > 1 Embedded OpenType (EOT), OpenSansRegular family > 28 empty > 9 exported SGML document, ASCII text > 1 FoxBase+/dBase III DBF, 279 records * 52, update-date 106-7-25, > codepage ID=0xf, at offset 161 1st record " 1das ist doch keine > leistung 44.00hw * 2Meister 48" > 2 GIF image data, version 89a, 16 x 16 > 1 GIF image data, version 89a, 9 x 9 > 1 gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT), > original size modulo 2^32 703 > 2 gzip compressed data, was "default.csv", last modified: Wed Aug 26 > 08:50:54 2015, from Unix, original size modulo 2^32 67 > 30 HTML document, ASCII text > 1 HTML document, ASCII text, with very long lines > 2 HTML document, UTF-8 Unicode text > 1 ISO-8859 text > 1 ISO-8859 text, with CR line terminators > 3 ISO-8859 text, with very long lines > 3179 Java source, ASCII text > 1 Java source, ASCII text, with CRLF, LF line terminators > 1 Java source, ASCII text, with very long lines > 13 Java source, UTF-8 Unicode text > 1 JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, > segment length 16, progressive, precision 8, 400x400, components 3 > 29 JSON data > 2 Little-endian UTF-16 Unicode text, with CRLF line terminators > 2 Little-endian UTF-16 Unicode text, with no line terminators > 10 Microsoft Excel 2007+ > 1 Microsoft OOXML > 1 MS Windows icon resource - 1 icon, 32x32, 24 bits/pixel > 2 MS Windows icon resource - 1 icon, 32x32, 32 bits/pixel > 3 Non-ISO extended-ASCII text, with no line terminators > 5 OpenDocument Spreadsheet > 1 PNG image data, 1244 x 686, 8-bit/color RGB, non-interlaced > 1 PNG image data, 1460 x 816, 8-bit/color RGB, non-interlaced > 2 PNG image data, 15 x 15, 8-bit/color RGBA, non-interlaced > 1 PNG image data, 1680 x 1050, 8-bit/color RGB, non-interlaced > 3 PNG image data, 16 x 16, 8-bit/color RGBA, non-interlaced > 2 PNG image data, 22 x 22, 8-bit/color RGB, non-interlaced > 1 PNG image data, 403 x 138, 8-bit/color RGB, non-interlaced > 4 PNG image data, 4702 x 1702, 8-bit/color RGB, non-interlaced > 4 PNG image data, 5010 x 1990, 8-bit/color RGB, non-interlaced > 1 PNG image data, 551 x 626, 8-bit/color RGB, non-interlaced > 1 PNG image data, 642 x 368, 8-bit/color RGBA, non-interlaced > 1 PNG image data, 972 x 464, 8-bit/color RGB, non-interlaced > 3 ReStructuredText file, ASCII text > 1 ReStructuredText file, ASCII text, with very long lines > 2 SAS > 654 SVG Scalable Vector Graphics image > 1 TIFF image data, big-endian, direntries=16, height=16, bps=0, > compression=none, PhotometricIntepretation=RGB, orientation=upper-left, > width=16 > 1 TrueType Font data, 11 tables, 1st "OS/2", 14 names, Macintosh, > type 1 string, icomoon > 1 TrueType Font data, 18 tables, 1st "FFTM", 26 names, Macintosh > 1 TrueType Font data, 18 tables, 1st "FFTM", 30 names, Macintosh > 2 Unicode text, UTF-32, big-endian > 2 Unicode text, UTF-32, little-endian > 385 UTF-8 Unicode text > 2 UTF-8 Unicode text, with no line terminators > 40 UTF-8 Unicode text, with very long lines > 2 UTF-8 Unicode (with BOM) text, with no line terminators > 1 Visual FoxPro DBF, 2 records * 205, update-date 15-10-20, at > offset 129 1st record "value11 > " > 1 Web Open Font Format, TrueType, length 1168, version 1.0 > 1 Web Open Font Format, TrueType, length 67528, version 1.10 > 1 Web Open Font Format, TrueType, length 69392, version 1.10 > 958 XML 1.0 document, ASCII text > 1 XML 1.0 document, ASCII text, with CRLF, LF line terminators > 82 XML 1.0 document, ASCII text, with very long lines > 1 XML 1.0 document, ASCII text, with very long lines, with no line > terminators > 1 XML 1.0 document, UTF-8 Unicode text > 2 XML 1.0 document, UTF-8 Unicode text, with very long lines > 1 XML 1.0 document, UTF-8 Unicode (with BOM) text > 2 Zip data (MIME type "application/vnd.pentaho.reporting.classic"?) > > Julian > > -- Neo4j Chief Solutions Architect *✉ *[email protected] ☎ +32486972937
